Datrics Model Deserialization from JSON

Datrics Classification Model JSON

Trained Datrics models are serialised into JSON format with the following structure:

meta - the name of datrics model

model_init_parameters - hyperparameters of datrics model that are applied at the model initialization stage

model_fit_parameters - specific parameters that impact the model fitting (e.g. source of the observation's weights)

model_predict_parameters - main configuration of the model's outcome (like, for instance, predict_probability binary flag that indicates if model returns the probability of classes)

additional_parameters - additional predicted parameters, e.g. type of the model's outcome (e.g. class, probability, etc)

grouping_columns - the list of columns that are used for the data stratification (see. For-Loop)

supported_category_values - for the models that support the categorical predictors: the list of separate categories that the categorical variable that are known for the trained model.

required_arguments - parameters that are mandatorily required for the model fitting like, for instance, the target variable.

keep_columns - the list of columns that were selected for the model fitting, including predictors, target, and fit-parameters source.

transformed_data_columns - descriptor of training data

sample_data - small sample of training data

sample_output - small sample of expected output

train_plots_true, train_plots_pred, train_plots_proba - data for the the model performance report

model_quality - model quality metrics (accuracy, f1-score, roc-auc score, etc.)

coefficients_summary - for the logistics regression only - coefficients significance report

trained_models - the detailed description of the trained model (or models in case of For-Loop) with metrics and additional parameters:

model - the core of the datrics model - JSON serialization of the fitted sklearn or dask implementation of the Data Mining / Machine Learning model
Accuracy, Precision, Recall, F1 score, ROC AUC, Gini - metrics per each fitted model
predictors - the list of predictors that accepts specific fitted model (useful for the For-Loop mode)

trained_models_index - the type of the grouping variable

JSON Example


{
  "meta": "Logistic_regression",
  "model_init_parameters": {
    "solver": "saga",
    "penalty": "l2",
    "class_weight": "balanced"
  },
  "model_fit_parameters": {},
  "model_predict_parameters": {
    "predict_probability": false
  },
  "additional_parameters": {
    "predict_proba": "class"
  },
  "grouping_columns": [],
  "supported_category_values": {},
  "required_arguments": {
    "target_variable": "Survived"
  },
  "light_run": false,
  "keep_columns": [
    "Age",
    "Pclass",
    "Survived"
  ],
  "columns": [
    "Age",
    "Pclass",
    "Survived"
  ],
  "coefficients_summary": {
    "Varibale": {
      "0": "Constant",
      "1": "Age",
      "2": "Pclass"
    },
    "Coefficients": {
      "0": 0.8769659488693399,
      "1": -0.0013320311474405886,
      "2": -0.42406367332139877
    },
    "Standard Errors": {
      "0": 0.083,
      "1": 0.002,
      "2": 0.024
    },
    "t values": {
      "0": 10.537,
      "1": -0.879,
      "2": -17.992
    },
    "Probabilities": {
      "0": 0.0,
      "1": 0.38,
      "2": 0.0
    }
  },
  "transformed_data_columns": [
    "Age",
    "Pclass"
  ],
  "model_quality": {
    "Accuracy": "0.7",
    "Precision": "0.69",
    "Recall": "0.7",
    "F1 score": "0.68",
    "ROC AUC": "0.71",
    "Gini": "0.42"
  },
  "sample_data": {
    "Age": {
      "0": 22.0
    },
    "Pclass": {
      "0": 3.0
    }
  },
  "sample_output": {
    "Survived": {
      "0": 0.0
    }
  },
  "train_plots_true": [...],
  "train_plots_pred": [...],
  "train_plots_proba": [...],
	"trained_models": {
    "model": {
      "1": {...}
    },
    "Accuracy": {
      "1": "0.7"
    },
    "Precision": {
      "1": "0.69"
    },
    "Recall": {
      "1": "0.7"
    },
    "F1 score": {
      "1": "0.68"
    },
    "ROC AUC": {
      "1": "0.71"
    },
    "Gini": {
      "1": "0.42"
    },
    "predictors": {
      "1": [
        "Age",
        "Pclass",
        "Survived"
      ]
    }
  }
}

Datrics Regression Model JSON

Trained Datrics models are serialised into JSON format with the following structure:

meta - the name of datrics model

model_init_parameters - hyperparameters of datrics model that are applied at the model initialization stage

model_train_parameters - model fitting specific parameters (e.g. categorical features processing strategy)

required_arguments - parameters that are mandatorily required for the model fitting like, for instance, the target variable.

required_arguments_types - supported types of the arguments that are required for the model training

supported_column_types - the predictors types that are supported in the model

dtypes - expected types of the input Dataframe columns

grouping_columns - the list of columns that are used for the data stratification (see. For-Loop)

supported_category_values - for the models that support the categorical predictors: the list of separate categories that the categorical variable that are known for the trained model.

keep_columns - the list of columns that were selected for the model fitting, including predictors, target, and fit-parameters source.

transformed_data_columns - descriptor of training data

sample_data - small sample of training data

sample_output - small sample of expected output

model_quality - model quality metrics

trained_models - the detailed description of the trained model (or models in case of For-Loop) with metrics and additional parameters:

model - the core of the datrics model - JSON serialization of the fitted sklearn or dask implementation of the Data Mining / Machine Learning model
R2, RMSE, MAPE - metrics per each fitted model

trained_models_index - the type of the grouping variable

regularization - regularisation for the linear regression only (lasso, ridge, elastic or None)

JSON Example


{
  "meta": "LightGBM_regressor",
  "model_init_parameters": {
    "objective": "regression",
    "importance_type": "gain",
    "boosting": "gbdt",
    "learning_rate": 0.05634966830778477,
    "num_iterations": 122,
    "num_leaves": 39,
    "reg_alpha": 0,
    "reg_lambda": 1,
    "random_state": 878479377
  },
  "model_train_parameters": {
    "categorical_feature": "auto"
  },
  "grouping_columns": [],
  "dtypes": {
    "Pclass": "float64",
    "Age": "float64",
    "Family": "float64",
    "Male": "float64",
    "Survived": "float64"
  },
  "predictions_quality": {
    "R2": {
      "0": "0.29"
    },
    "RMSE": {
      "0": "10.96"
    },
    "MAPE": {
      "0": "55.9"
    }
  },
  "required_arguments_types": {
    "target_variable": [
      "int64",
      "float64"
    ]
  },
  "optional_arguments_types": {},
  "supported_column_types": [
    "int64",
    "float64",
    "bool",
    "uint8",
    "int8",
    "int16",
    "category"
  ],
  "model_quality": {
    "R2": "0.29",
    "RMSE": "10.96",
    "MAPE": "91.78"
  },
  "non_negative_predictions": true,
  "supports_nan": true,
  "keep_columns": [
    "Pclass",
    "Age",
    "Family",
    "Male",
    "Survived"
  ],
  "light_run": false,
  "required_arguments": {
    "target_variable": "Age"
  },
  "optional_arguments": {},
  "columns": [
    "Pclass",
    "Age",
    "Family",
    "Male",
    "Survived"
  ],
  "supported_category_values": {},
  "sample_data": {
    "Pclass": {
      "178": 3.0
    },
    "Family": {
      "178": 0.0
    },
    "Male": {
      "178": 1.0
    },
    "Survived": {
      "178": 0.0
    }
  },
  "sample_output": {
    "Age": {
      "178": 36.0
    }
  },
  "transformed_data_columns": [
    "Pclass",
    "Family",
    "Male",
    "Survived"
  ],
  "trained_models": {
    "model": {
      "1": {
        "meta": "lgbm_regressor",
        "boosting": "lgbm",
        "model": {
          "name": "tree",
          "version": "v3",
          "num_class": 1,
          "num_tree_per_iteration": 1,
          "label_index": 0,
          "max_feature_idx": 3,
          "average_output": false,
          "objective": "regression",
          "feature_names": [
            "Pclass",
            "Family",
            "Male",
            "Survived"
          ],
          "monotone_constraints": [],
          "tree_info": [...],
          "pandas_categorical": []
        },
        "booster": "...",
      }
    },
    "R2": {
      "1": "0.29"
    },
    "RMSE": {
      "1": "10.96"
    },
    "MAPE": {
      "1": "91.78"
    }
  },
  "trained_models_index": "int64"
}

Datrics Clustering Model JSON

Trained Datrics models are serialised into JSON format with the following structure:

meta - the name of datrics model

model_init_parameters - hyperparameters of datrics model that are applied at the model initialization stage

required_arguments - parameters that are mandatorily required for the model fitting like, for instance, the target variable.

required_arguments_types - supported types of the arguments that are required for the model training

supported_column_types - the predictors types that are supported in the model

grouping_columns - the list of columns that are used for the data stratification (see. For-Loop)

keep_columns - the list of columns that were selected for the model fitting, including predictors, target, and fit-parameters source.

sample_data - small sample of training data

sample_output - small sample of expected output

model_quality - model quality metrics

trained_models - the detailed description of the trained model (or models in case of For-Loop) with metrics and additional parameters:

model - the core of the datrics model - JSON serialization of the fitted sklearn or dask implementation of the Data Mining / Machine Learning model

trained_models_index - the type of the grouping variable

JSON Example


{
  "meta": "KMeans_segmentation",
  "model_init_parameters": {
    "n_clusters": 5,
    "random_state": 302433120
  },
  "grouping_columns": [],
  "required_arguments_types": {},
  "optional_arguments_types": {},
  "supported_column_types": [
    "int64",
    "float64",
    "uint8",
    "int8",
    "int16"
  ],
  "model_quality": {
    "Coming soon": "Metrics"
  },
  "supports_nan": false,
  "keep_columns": [
    "Age",
    "Survived",
    "Male"
  ],
  "light_run": false,
  "required_arguments": {},
  "optional_arguments": {},
  "columns": [
    "Age",
    "Survived",
    "Male"
  ],
  "sample_data": {
    "Age": {
      "0": 22.0
    },
    "Survived": {
      "0": 0.0
    },
    "Male": {
      "0": 1.0
    }
  },
  "sample_output": {
    "cluster": {
      "0": 1
    }
  },
  "trained_models": {
    "model": {
      "1": {
        "meta": "kmeans_clustering",
        "cluster_centers_": [
          [
            20.431034482758562,
            0.3620689655172413,
            0.6120689655172414
          ],
          [
            43.56617647058819,
            0.37499999999999994,
            0.6397058823529412
          ],
          [
            30.042929292929294,
            0.3686868686868686,
            0.6818181818181819
          ],
          [
            4.695652173912997,
            0.5797101449275361,
            0.5362318840579711
          ],
          [
            59.60714285714282,
            0.33928571428571425,
            0.7321428571428572
          ]
        ],
        "labels_": [...],
        "inertia_": 10690.017542906045,
        "n_features_in_": 3,
        "n_iter_": 4,
        "_n_threads": 1,
        "_tol": 0.005624848721806276,
        "params": {
          "algorithm": "auto",
          "copy_x": true,
          "init": "k-means++",
          "max_iter": 300,
          "n_clusters": 5,
          "n_init": 10,
          "n_jobs": "deprecated",
          "precompute_distances": "deprecated",
          "random_state": 302433120,
          "tol": 0.0001,
          "verbose": 0
        }
      }
    },
    "Coming soon": {
      "1": "Metrics"
    }
  },
  "trained_models_index": "int64"
}

Deserialization

For the possibility to use the trained Datrics models outside the Datrics platform, the datrics_json library has been developed. The model performs the deserialization of Datrics models from their JSON representation.

Install


pip install datrics-json

Example Usage


import datrics_json as datjson

model_dict = datjson.from_json(file_name)
deserialized_model = list(model_dict.get('trained_models').values())[0]['model']

sample_data = model_dict.get('sample_data')['input']
deserialized_model.predict(sample_data)

Features

sklearn-json requires scikit-learn >= 0.22.2. LightGBM >= 2.3.1

Supported scikit-learn Models

sklearn.linear_model.LogisticRegression

sklearn.ensemble.IsolationForest

sklearn.clustering.KMeans

sklearn.clustering.DBSCAN

sklearn.linear_model.LinearRegression

sklearn.linear_model.Ridge

sklearn.linear_model.Lasso

sklearn.linear_model.ElasticNet

Supported lightGBM Models

lightgbm.LGBMClassifier - binary - Gradient Boosting Trees

lightgbm.LGBMClassifier - multiclass - Gradient Boosting Trees

lightgbm.LGBMClassifier - binary - Random Forest

lightgbm.LGBMClassifier - multiclass - Random Forest

lightgbm.LGBMRegressor - Gradient Boosting Trees

lightgbm.LGBMRegressor - Random Forest

Test data

Examples of JSON Datrics models represendation

Supported Models JSON Structure

The trained models can be deserialized outside the Datrics platform due to complete compatibility with the internal structure of the sklearn's and dask's model's implementation.

Logistics Regression

scikit-learn - 0.23.2


model_dict = 
{
  "meta": "lr",
  "classes_": < List of Classes >,
  "coef_": < N x M List > : N - number of classes (1 for the bibary case), 
														M - number of predictors
  "intercept_": < N x 1 List >
  "n_iter_":  < N x 1 List >,
  "params": {
    "C": float,
    "class_weight": dict or 'balanced', default=None
    "dual": boolean,
    "fit_intercept": true,
    "intercept_scaling": float,
    "l1_ratio": float,
    "max_iter": int,
    "multi_class": str,
    "n_jobs": int,
    "penalty": str,
    "random_state": int,
    "solver": str,
    "tol": float,
    "verbose": int,
    "warm_start": false
  }
}

Linear Regression

scikit-learn - 0.23.2


model_dict = 
{
  "meta": "lr",
  "coef_": < N x M List > : N - number of classes (1 for the bibary case), 
														M - number of predictors
  "intercept_": < N x 1 List >
  "n_iter_":  < N x 1 List >,
  "params": {
			"fit_intercept": bool, default=True. Model intercept flag
			"normalize": bool, default=False. Normalize flag for the regressors
			"copy_X": bool, default=True
			"n_jobs": int, default=None. The number of jobs to use for the computation.
  }
}

Lasso Regression

scikit-learn - 0.23.2


model_dict =
{
	"meta": "lasso-regression",
	...
}

Ridge Regression

scikit-learn - 0.23.2


model_dict = 
{
	"meta": "ridge-regression",
	...
}

Elastic Regression

scikit-learn - 0.23.2


model_dict = 
{
	"meta": "elasticnet-regression",
	...
}

K-Means Clustering

scikit-learn - 0.23.2


model_dict = 
{
	"meta": "kmeans_clustering",
	...
}

Isolation Forest

scikit-learn - 0.23.2


model_dict = 
{
	"meta": "iforest_anomaly",
	...
}

LGBM Binary Classification

LightGBM >= 2.3.1


model_dict = 
{
	"meta": "lgbm_binary",
	...
}

Random Forest Binary Classification

LightGBM >= 2.3.1


model_dict = 
{
	"meta": "rf_binary",
	...
}

LGBM Multiclass Classification

LightGBM >= 2.3.1


model_dict = 
{
	"meta": "lgbm_multiclass",
	...
}

Random Forest Multiclass Classification

LightGBM >= 2.3.1


model_dict = 
{
	"meta": "rf_multiclass",
	...
}

LGBM Regression

LightGBM >= 2.3.1


model_dict = 
{
	"meta": "lgbm_regressor",
	...
}

Random Forest Regression

LightGBM >= 2.3.1


model_dict = 
{
	"meta": "rf_regressor",
	...
}