Datrics Classification Model JSON
Trained Datrics models are serialised into JSON format with the following structure:
- meta - the name of datrics model
- model_init_parameters - hyperparameters of datrics model that are applied at the model initialization stage
- model_fit_parameters - specific parameters that impact the model fitting (e.g. source of the observation's weights)
- model_predict_parameters - main configuration of the model's outcome (like, for instance, predict_probability binary flag that indicates if model returns the probability of classes)
- additional_parameters - additional predicted parameters, e.g. type of the model's outcome (e.g. class, probability, etc)
- grouping_columns - the list of columns that are used for the data stratification (see. For-Loop)
- supported_category_values - for the models that support the categorical predictors: the list of separate categories that the categorical variable that are known for the trained model.
- required_arguments - parameters that are mandatorily required for the model fitting like, for instance, the target variable.
- keep_columns - the list of columns that were selected for the model fitting, including predictors, target, and fit-parameters source.
- transformed_data_columns - descriptor of training data
- sample_data - small sample of training data
- sample_output - small sample of expected output
- train_plots_true, train_plots_pred, train_plots_proba - data for the the model performance report
- model_quality - model quality metrics (accuracy, f1-score, roc-auc score, etc.)
- coefficients_summary - for the logistics regression only - coefficients significance report
- trained_models - the detailed description of the trained model (or models in case of For-Loop) with metrics and additional parameters:
- model - the core of the datrics model - JSON serialization of the fitted sklearn or dask implementation of the Data Mining / Machine Learning model
- Accuracy, Precision, Recall, F1 score, ROC AUC, Gini - metrics per each fitted model
- predictors - the list of predictors that accepts specific fitted model (useful for the For-Loop mode)
- trained_models_index - the type of the grouping variable
JSON Example
{ "meta": "Logistic_regression", "model_init_parameters": { "solver": "saga", "penalty": "l2", "class_weight": "balanced" }, "model_fit_parameters": {}, "model_predict_parameters": { "predict_probability": false }, "additional_parameters": { "predict_proba": "class" }, "grouping_columns": [], "supported_category_values": {}, "required_arguments": { "target_variable": "Survived" }, "light_run": false, "keep_columns": [ "Age", "Pclass", "Survived" ], "columns": [ "Age", "Pclass", "Survived" ], "coefficients_summary": { "Varibale": { "0": "Constant", "1": "Age", "2": "Pclass" }, "Coefficients": { "0": 0.8769659488693399, "1": -0.0013320311474405886, "2": -0.42406367332139877 }, "Standard Errors": { "0": 0.083, "1": 0.002, "2": 0.024 }, "t values": { "0": 10.537, "1": -0.879, "2": -17.992 }, "Probabilities": { "0": 0.0, "1": 0.38, "2": 0.0 } }, "transformed_data_columns": [ "Age", "Pclass" ], "model_quality": { "Accuracy": "0.7", "Precision": "0.69", "Recall": "0.7", "F1 score": "0.68", "ROC AUC": "0.71", "Gini": "0.42" }, "sample_data": { "Age": { "0": 22.0 }, "Pclass": { "0": 3.0 } }, "sample_output": { "Survived": { "0": 0.0 } }, "train_plots_true": [...], "train_plots_pred": [...], "train_plots_proba": [...], "trained_models": { "model": { "1": {...} }, "Accuracy": { "1": "0.7" }, "Precision": { "1": "0.69" }, "Recall": { "1": "0.7" }, "F1 score": { "1": "0.68" }, "ROC AUC": { "1": "0.71" }, "Gini": { "1": "0.42" }, "predictors": { "1": [ "Age", "Pclass", "Survived" ] } } }
Datrics Regression Model JSON
Trained Datrics models are serialised into JSON format with the following structure:
- meta - the name of datrics model
- model_init_parameters - hyperparameters of datrics model that are applied at the model initialization stage
- model_train_parameters - model fitting specific parameters (e.g. categorical features processing strategy)
- required_arguments - parameters that are mandatorily required for the model fitting like, for instance, the target variable.
- required_arguments_types - supported types of the arguments that are required for the model training
- supported_column_types - the predictors types that are supported in the model
- dtypes - expected types of the input Dataframe columns
- grouping_columns - the list of columns that are used for the data stratification (see. For-Loop)
- supported_category_values - for the models that support the categorical predictors: the list of separate categories that the categorical variable that are known for the trained model.
- keep_columns - the list of columns that were selected for the model fitting, including predictors, target, and fit-parameters source.
- transformed_data_columns - descriptor of training data
- sample_data - small sample of training data
- sample_output - small sample of expected output
- model_quality - model quality metrics
- trained_models - the detailed description of the trained model (or models in case of For-Loop) with metrics and additional parameters:
- model - the core of the datrics model - JSON serialization of the fitted sklearn or dask implementation of the Data Mining / Machine Learning model
- R2, RMSE, MAPE - metrics per each fitted model
- trained_models_index - the type of the grouping variable
- regularization - regularisation for the linear regression only (lasso, ridge, elastic or None)
JSON Example
{ "meta": "LightGBM_regressor", "model_init_parameters": { "objective": "regression", "importance_type": "gain", "boosting": "gbdt", "learning_rate": 0.05634966830778477, "num_iterations": 122, "num_leaves": 39, "reg_alpha": 0, "reg_lambda": 1, "random_state": 878479377 }, "model_train_parameters": { "categorical_feature": "auto" }, "grouping_columns": [], "dtypes": { "Pclass": "float64", "Age": "float64", "Family": "float64", "Male": "float64", "Survived": "float64" }, "predictions_quality": { "R2": { "0": "0.29" }, "RMSE": { "0": "10.96" }, "MAPE": { "0": "55.9" } }, "required_arguments_types": { "target_variable": [ "int64", "float64" ] }, "optional_arguments_types": {}, "supported_column_types": [ "int64", "float64", "bool", "uint8", "int8", "int16", "category" ], "model_quality": { "R2": "0.29", "RMSE": "10.96", "MAPE": "91.78" }, "non_negative_predictions": true, "supports_nan": true, "keep_columns": [ "Pclass", "Age", "Family", "Male", "Survived" ], "light_run": false, "required_arguments": { "target_variable": "Age" }, "optional_arguments": {}, "columns": [ "Pclass", "Age", "Family", "Male", "Survived" ], "supported_category_values": {}, "sample_data": { "Pclass": { "178": 3.0 }, "Family": { "178": 0.0 }, "Male": { "178": 1.0 }, "Survived": { "178": 0.0 } }, "sample_output": { "Age": { "178": 36.0 } }, "transformed_data_columns": [ "Pclass", "Family", "Male", "Survived" ], "trained_models": { "model": { "1": { "meta": "lgbm_regressor", "boosting": "lgbm", "model": { "name": "tree", "version": "v3", "num_class": 1, "num_tree_per_iteration": 1, "label_index": 0, "max_feature_idx": 3, "average_output": false, "objective": "regression", "feature_names": [ "Pclass", "Family", "Male", "Survived" ], "monotone_constraints": [], "tree_info": [...], "pandas_categorical": [] }, "booster": "...", } }, "R2": { "1": "0.29" }, "RMSE": { "1": "10.96" }, "MAPE": { "1": "91.78" } }, "trained_models_index": "int64" }
Datrics Clustering Model JSON
Trained Datrics models are serialised into JSON format with the following structure:
- meta - the name of datrics model
- model_init_parameters - hyperparameters of datrics model that are applied at the model initialization stage
- required_arguments - parameters that are mandatorily required for the model fitting like, for instance, the target variable.
- required_arguments_types - supported types of the arguments that are required for the model training
- supported_column_types - the predictors types that are supported in the model
- grouping_columns - the list of columns that are used for the data stratification (see. For-Loop)
- keep_columns - the list of columns that were selected for the model fitting, including predictors, target, and fit-parameters source.
- sample_data - small sample of training data
- sample_output - small sample of expected output
- model_quality - model quality metrics
- trained_models - the detailed description of the trained model (or models in case of For-Loop) with metrics and additional parameters:
- model - the core of the datrics model - JSON serialization of the fitted sklearn or dask implementation of the Data Mining / Machine Learning model
- trained_models_index - the type of the grouping variable
JSON Example
{ "meta": "KMeans_segmentation", "model_init_parameters": { "n_clusters": 5, "random_state": 302433120 }, "grouping_columns": [], "required_arguments_types": {}, "optional_arguments_types": {}, "supported_column_types": [ "int64", "float64", "uint8", "int8", "int16" ], "model_quality": { "Coming soon": "Metrics" }, "supports_nan": false, "keep_columns": [ "Age", "Survived", "Male" ], "light_run": false, "required_arguments": {}, "optional_arguments": {}, "columns": [ "Age", "Survived", "Male" ], "sample_data": { "Age": { "0": 22.0 }, "Survived": { "0": 0.0 }, "Male": { "0": 1.0 } }, "sample_output": { "cluster": { "0": 1 } }, "trained_models": { "model": { "1": { "meta": "kmeans_clustering", "cluster_centers_": [ [ 20.431034482758562, 0.3620689655172413, 0.6120689655172414 ], [ 43.56617647058819, 0.37499999999999994, 0.6397058823529412 ], [ 30.042929292929294, 0.3686868686868686, 0.6818181818181819 ], [ 4.695652173912997, 0.5797101449275361, 0.5362318840579711 ], [ 59.60714285714282, 0.33928571428571425, 0.7321428571428572 ] ], "labels_": [...], "inertia_": 10690.017542906045, "n_features_in_": 3, "n_iter_": 4, "_n_threads": 1, "_tol": 0.005624848721806276, "params": { "algorithm": "auto", "copy_x": true, "init": "k-means++", "max_iter": 300, "n_clusters": 5, "n_init": 10, "n_jobs": "deprecated", "precompute_distances": "deprecated", "random_state": 302433120, "tol": 0.0001, "verbose": 0 } } }, "Coming soon": { "1": "Metrics" } }, "trained_models_index": "int64" }
Deserialization
For the possibility to use the trained Datrics models outside the Datrics platform, the datrics_json library has been developed. The model performs the deserialization of Datrics models from their JSON representation.
Install
pip install datrics-json
Example Usage
import datrics_json as datjson model_dict = datjson.from_json(file_name) deserialized_model = list(model_dict.get('trained_models').values())[0]['model'] sample_data = model_dict.get('sample_data')['input'] deserialized_model.predict(sample_data)
Features
sklearn-json requires scikit-learn >= 0.22.2. LightGBM >= 2.3.1
Supported scikit-learn Models
- sklearn.linear_model.LogisticRegression
- sklearn.ensemble.IsolationForest
- sklearn.clustering.KMeans
- sklearn.clustering.DBSCAN
- sklearn.linear_model.LinearRegression
- sklearn.linear_model.Ridge
- sklearn.linear_model.Lasso
- sklearn.linear_model.ElasticNet
Supported lightGBM Models
- lightgbm.LGBMClassifier - binary - Gradient Boosting Trees
- lightgbm.LGBMClassifier - multiclass - Gradient Boosting Trees
- lightgbm.LGBMClassifier - binary - Random Forest
- lightgbm.LGBMClassifier - multiclass - Random Forest
- lightgbm.LGBMRegressor - Gradient Boosting Trees
- lightgbm.LGBMRegressor - Random Forest
Test data
Supported Models JSON Structure
The trained models can be deserialized outside the Datrics platform due to complete compatibility with the internal structure of the sklearn's and dask's model's implementation.
Logistics Regression
scikit-learn - 0.23.2
model_dict = { "meta": "lr", "classes_": < List of Classes >, "coef_": < N x M List > : N - number of classes (1 for the bibary case), M - number of predictors "intercept_": < N x 1 List > "n_iter_": < N x 1 List >, "params": { "C": float, "class_weight": dict or 'balanced', default=None "dual": boolean, "fit_intercept": true, "intercept_scaling": float, "l1_ratio": float, "max_iter": int, "multi_class": str, "n_jobs": int, "penalty": str, "random_state": int, "solver": str, "tol": float, "verbose": int, "warm_start": false } }
Linear Regression
scikit-learn - 0.23.2
model_dict = { "meta": "lr", "coef_": < N x M List > : N - number of classes (1 for the bibary case), M - number of predictors "intercept_": < N x 1 List > "n_iter_": < N x 1 List >, "params": { "fit_intercept": bool, default=True. Model intercept flag "normalize": bool, default=False. Normalize flag for the regressors "copy_X": bool, default=True "n_jobs": int, default=None. The number of jobs to use for the computation. } }
Lasso Regression
scikit-learn - 0.23.2
model_dict = { "meta": "lasso-regression", ... }
Ridge Regression
scikit-learn - 0.23.2
model_dict = { "meta": "ridge-regression", ... }
Elastic Regression
scikit-learn - 0.23.2
model_dict = { "meta": "elasticnet-regression", ... }
K-Means Clustering
scikit-learn - 0.23.2
model_dict = { "meta": "kmeans_clustering", ... }
Isolation Forest
scikit-learn - 0.23.2
model_dict = { "meta": "iforest_anomaly", ... }
LGBM Binary Classification
LightGBM >= 2.3.1
model_dict = { "meta": "lgbm_binary", ... }
Random Forest Binary Classification
LightGBM >= 2.3.1
model_dict = { "meta": "rf_binary", ... }
LGBM Multiclass Classification
LightGBM >= 2.3.1
model_dict = { "meta": "lgbm_multiclass", ... }
Random Forest Multiclass Classification
LightGBM >= 2.3.1
model_dict = { "meta": "rf_multiclass", ... }
LGBM Regression
LightGBM >= 2.3.1
model_dict = { "meta": "lgbm_regressor", ... }
Random Forest Regression
LightGBM >= 2.3.1
model_dict = { "meta": "rf_regressor", ... }