Datrics Model Deserialization from JSON

Datrics Model Deserialization from JSON

Datrics Classification Model JSON

Trained Datrics models are serialised into JSON format with the following structure:
  • meta - the name of datrics model
  • model_init_parameters - hyperparameters of datrics model that are applied at the model initialization stage
  • model_fit_parameters - specific parameters that impact the model fitting (e.g. source of the observation's weights)
  • model_predict_parameters - main configuration of the model's outcome (like, for instance, predict_probability binary flag that indicates if model returns the probability of classes)
  • additional_parameters - additional predicted parameters, e.g. type of the model's outcome (e.g. class, probability, etc)
  • grouping_columns - the list of columns that are used for the data stratification (see. For-Loop)
  • supported_category_values - for the models that support the categorical predictors: the list of separate categories that the categorical variable that are known for the trained model.
  • required_arguments - parameters that are mandatorily required for the model fitting like, for instance, the target variable.
  • keep_columns - the list of columns that were selected for the model fitting, including predictors, target, and fit-parameters source.
  • transformed_data_columns - descriptor of training data
  • sample_data - small sample of training data
  • sample_output - small sample of expected output
  • train_plots_true, train_plots_pred, train_plots_proba - data for the the model performance report
  • model_quality - model quality metrics (accuracy, f1-score, roc-auc score, etc.)
  • coefficients_summary - for the logistics regression only - coefficients significance report
  • trained_models - the detailed description of the trained model (or models in case of For-Loop) with metrics and additional parameters:
    • model - the core of the datrics model - JSON serialization of the fitted sklearn or dask implementation of the Data Mining / Machine Learning model
    • Accuracy, Precision, Recall, F1 score, ROC AUC, Gini - metrics per each fitted model
    • predictors - the list of predictors that accepts specific fitted model (useful for the For-Loop mode)
  • trained_models_index - the type of the grouping variable
JSON Example
{ "meta": "Logistic_regression", "model_init_parameters": { "solver": "saga", "penalty": "l2", "class_weight": "balanced" }, "model_fit_parameters": {}, "model_predict_parameters": { "predict_probability": false }, "additional_parameters": { "predict_proba": "class" }, "grouping_columns": [], "supported_category_values": {}, "required_arguments": { "target_variable": "Survived" }, "light_run": false, "keep_columns": [ "Age", "Pclass", "Survived" ], "columns": [ "Age", "Pclass", "Survived" ], "coefficients_summary": { "Varibale": { "0": "Constant", "1": "Age", "2": "Pclass" }, "Coefficients": { "0": 0.8769659488693399, "1": -0.0013320311474405886, "2": -0.42406367332139877 }, "Standard Errors": { "0": 0.083, "1": 0.002, "2": 0.024 }, "t values": { "0": 10.537, "1": -0.879, "2": -17.992 }, "Probabilities": { "0": 0.0, "1": 0.38, "2": 0.0 } }, "transformed_data_columns": [ "Age", "Pclass" ], "model_quality": { "Accuracy": "0.7", "Precision": "0.69", "Recall": "0.7", "F1 score": "0.68", "ROC AUC": "0.71", "Gini": "0.42" }, "sample_data": { "Age": { "0": 22.0 }, "Pclass": { "0": 3.0 } }, "sample_output": { "Survived": { "0": 0.0 } }, "train_plots_true": [...], "train_plots_pred": [...], "train_plots_proba": [...], "trained_models": { "model": { "1": {...} }, "Accuracy": { "1": "0.7" }, "Precision": { "1": "0.69" }, "Recall": { "1": "0.7" }, "F1 score": { "1": "0.68" }, "ROC AUC": { "1": "0.71" }, "Gini": { "1": "0.42" }, "predictors": { "1": [ "Age", "Pclass", "Survived" ] } } }
 

Datrics Regression Model JSON

Trained Datrics models are serialised into JSON format with the following structure:
  • meta - the name of datrics model
  • model_init_parameters - hyperparameters of datrics model that are applied at the model initialization stage
  • model_train_parameters - model fitting specific parameters (e.g. categorical features processing strategy)
  • required_arguments - parameters that are mandatorily required for the model fitting like, for instance, the target variable.
  • required_arguments_types - supported types of the arguments that are required for the model training
  • supported_column_types - the predictors types that are supported in the model
  • dtypes - expected types of the input Dataframe columns
  • grouping_columns - the list of columns that are used for the data stratification (see. For-Loop)
  • supported_category_values - for the models that support the categorical predictors: the list of separate categories that the categorical variable that are known for the trained model.
  • keep_columns - the list of columns that were selected for the model fitting, including predictors, target, and fit-parameters source.
  • transformed_data_columns - descriptor of training data
  • sample_data - small sample of training data
  • sample_output - small sample of expected output
  • model_quality - model quality metrics
  • trained_models - the detailed description of the trained model (or models in case of For-Loop) with metrics and additional parameters:
    • model - the core of the datrics model - JSON serialization of the fitted sklearn or dask implementation of the Data Mining / Machine Learning model
    • R2, RMSE, MAPE - metrics per each fitted model
  • trained_models_index - the type of the grouping variable
  • regularization - regularisation for the linear regression only (lasso, ridge, elastic or None)
JSON Example
{ "meta": "LightGBM_regressor", "model_init_parameters": { "objective": "regression", "importance_type": "gain", "boosting": "gbdt", "learning_rate": 0.05634966830778477, "num_iterations": 122, "num_leaves": 39, "reg_alpha": 0, "reg_lambda": 1, "random_state": 878479377 }, "model_train_parameters": { "categorical_feature": "auto" }, "grouping_columns": [], "dtypes": { "Pclass": "float64", "Age": "float64", "Family": "float64", "Male": "float64", "Survived": "float64" }, "predictions_quality": { "R2": { "0": "0.29" }, "RMSE": { "0": "10.96" }, "MAPE": { "0": "55.9" } }, "required_arguments_types": { "target_variable": [ "int64", "float64" ] }, "optional_arguments_types": {}, "supported_column_types": [ "int64", "float64", "bool", "uint8", "int8", "int16", "category" ], "model_quality": { "R2": "0.29", "RMSE": "10.96", "MAPE": "91.78" }, "non_negative_predictions": true, "supports_nan": true, "keep_columns": [ "Pclass", "Age", "Family", "Male", "Survived" ], "light_run": false, "required_arguments": { "target_variable": "Age" }, "optional_arguments": {}, "columns": [ "Pclass", "Age", "Family", "Male", "Survived" ], "supported_category_values": {}, "sample_data": { "Pclass": { "178": 3.0 }, "Family": { "178": 0.0 }, "Male": { "178": 1.0 }, "Survived": { "178": 0.0 } }, "sample_output": { "Age": { "178": 36.0 } }, "transformed_data_columns": [ "Pclass", "Family", "Male", "Survived" ], "trained_models": { "model": { "1": { "meta": "lgbm_regressor", "boosting": "lgbm", "model": { "name": "tree", "version": "v3", "num_class": 1, "num_tree_per_iteration": 1, "label_index": 0, "max_feature_idx": 3, "average_output": false, "objective": "regression", "feature_names": [ "Pclass", "Family", "Male", "Survived" ], "monotone_constraints": [], "tree_info": [...], "pandas_categorical": [] }, "booster": "...", } }, "R2": { "1": "0.29" }, "RMSE": { "1": "10.96" }, "MAPE": { "1": "91.78" } }, "trained_models_index": "int64" }
 

Datrics Clustering Model JSON

Trained Datrics models are serialised into JSON format with the following structure:
  • meta - the name of datrics model
  • model_init_parameters - hyperparameters of datrics model that are applied at the model initialization stage
  • required_arguments - parameters that are mandatorily required for the model fitting like, for instance, the target variable.
  • required_arguments_types - supported types of the arguments that are required for the model training
  • supported_column_types - the predictors types that are supported in the model
  • grouping_columns - the list of columns that are used for the data stratification (see. For-Loop)
  • keep_columns - the list of columns that were selected for the model fitting, including predictors, target, and fit-parameters source.
  • sample_data - small sample of training data
  • sample_output - small sample of expected output
  • model_quality - model quality metrics
  • trained_models - the detailed description of the trained model (or models in case of For-Loop) with metrics and additional parameters:
    • model - the core of the datrics model - JSON serialization of the fitted sklearn or dask implementation of the Data Mining / Machine Learning model
  • trained_models_index - the type of the grouping variable
JSON Example
{ "meta": "KMeans_segmentation", "model_init_parameters": { "n_clusters": 5, "random_state": 302433120 }, "grouping_columns": [], "required_arguments_types": {}, "optional_arguments_types": {}, "supported_column_types": [ "int64", "float64", "uint8", "int8", "int16" ], "model_quality": { "Coming soon": "Metrics" }, "supports_nan": false, "keep_columns": [ "Age", "Survived", "Male" ], "light_run": false, "required_arguments": {}, "optional_arguments": {}, "columns": [ "Age", "Survived", "Male" ], "sample_data": { "Age": { "0": 22.0 }, "Survived": { "0": 0.0 }, "Male": { "0": 1.0 } }, "sample_output": { "cluster": { "0": 1 } }, "trained_models": { "model": { "1": { "meta": "kmeans_clustering", "cluster_centers_": [ [ 20.431034482758562, 0.3620689655172413, 0.6120689655172414 ], [ 43.56617647058819, 0.37499999999999994, 0.6397058823529412 ], [ 30.042929292929294, 0.3686868686868686, 0.6818181818181819 ], [ 4.695652173912997, 0.5797101449275361, 0.5362318840579711 ], [ 59.60714285714282, 0.33928571428571425, 0.7321428571428572 ] ], "labels_": [...], "inertia_": 10690.017542906045, "n_features_in_": 3, "n_iter_": 4, "_n_threads": 1, "_tol": 0.005624848721806276, "params": { "algorithm": "auto", "copy_x": true, "init": "k-means++", "max_iter": 300, "n_clusters": 5, "n_init": 10, "n_jobs": "deprecated", "precompute_distances": "deprecated", "random_state": 302433120, "tol": 0.0001, "verbose": 0 } } }, "Coming soon": { "1": "Metrics" } }, "trained_models_index": "int64" }
 

Deserialization

For the possibility to use the trained Datrics models outside the Datrics platform, the datrics_json library has been developed. The model performs the deserialization of Datrics models from their JSON representation.

Install

pip install datrics-json

Example Usage

import datrics_json as datjson model_dict = datjson.from_json(file_name) deserialized_model = list(model_dict.get('trained_models').values())[0]['model'] sample_data = model_dict.get('sample_data')['input'] deserialized_model.predict(sample_data)

Features

sklearn-json requires scikit-learn >= 0.22.2. LightGBM >= 2.3.1

Supported scikit-learn Models

  • sklearn.linear_model.LogisticRegression
  • sklearn.ensemble.IsolationForest
  • sklearn.clustering.KMeans
  • sklearn.clustering.DBSCAN
  • sklearn.linear_model.LinearRegression
  • sklearn.linear_model.Ridge
  • sklearn.linear_model.Lasso
  • sklearn.linear_model.ElasticNet

Supported lightGBM Models

  • lightgbm.LGBMClassifier - binary - Gradient Boosting Trees
  • lightgbm.LGBMClassifier - multiclass - Gradient Boosting Trees
  • lightgbm.LGBMClassifier - binary - Random Forest
  • lightgbm.LGBMClassifier - multiclass - Random Forest
  • lightgbm.LGBMRegressor - Gradient Boosting Trees
  • lightgbm.LGBMRegressor - Random Forest

Test data

Supported Models JSON Structure

The trained models can be deserialized outside the Datrics platform due to complete compatibility with the internal structure of the sklearn's and dask's model's implementation.

Logistics Regression

scikit-learn - 0.23.2
model_dict = { "meta": "lr", "classes_": < List of Classes >, "coef_": < N x M List > : N - number of classes (1 for the bibary case), M - number of predictors "intercept_": < N x 1 List > "n_iter_": < N x 1 List >, "params": { "C": float, "class_weight": dict or 'balanced', default=None "dual": boolean, "fit_intercept": true, "intercept_scaling": float, "l1_ratio": float, "max_iter": int, "multi_class": str, "n_jobs": int, "penalty": str, "random_state": int, "solver": str, "tol": float, "verbose": int, "warm_start": false } }

Linear Regression

scikit-learn - 0.23.2
model_dict = { "meta": "lr", "coef_": < N x M List > : N - number of classes (1 for the bibary case), M - number of predictors "intercept_": < N x 1 List > "n_iter_": < N x 1 List >, "params": { "fit_intercept": bool, default=True. Model intercept flag "normalize": bool, default=False. Normalize flag for the regressors "copy_X": bool, default=True "n_jobs": int, default=None. The number of jobs to use for the computation. } }

Lasso Regression

scikit-learn - 0.23.2
model_dict = { "meta": "lasso-regression", ... }

Ridge Regression

scikit-learn - 0.23.2
model_dict = { "meta": "ridge-regression", ... }

Elastic Regression

scikit-learn - 0.23.2
model_dict = { "meta": "elasticnet-regression", ... }

K-Means Clustering

scikit-learn - 0.23.2
model_dict = { "meta": "kmeans_clustering", ... }

Isolation Forest

scikit-learn - 0.23.2
model_dict = { "meta": "iforest_anomaly", ... }

LGBM Binary Classification

LightGBM >= 2.3.1
model_dict = { "meta": "lgbm_binary", ... }

Random Forest Binary Classification

LightGBM >= 2.3.1
model_dict = { "meta": "rf_binary", ... }

LGBM Multiclass Classification

LightGBM >= 2.3.1
model_dict = { "meta": "lgbm_multiclass", ... }

Random Forest Multiclass Classification

LightGBM >= 2.3.1
model_dict = { "meta": "rf_multiclass", ... }

LGBM Regression

LightGBM >= 2.3.1
model_dict = { "meta": "lgbm_regressor", ... }

Random Forest Regression

LightGBM >= 2.3.1
model_dict = { "meta": "rf_regressor", ... }