Predictive Model

General Information

This component is a universal solution for "supervised machine learning problem" solving. Supervised machine learning can be defined as learning a function that performs the mapping of the independent characteristics that describe objects or phenomena to the expected outcomes (categories, values, etc.). The inference of the analytical function is performed based on the trained examples - the model fitting process lies in feeding the data into the model and iterative adjustment of the model's weights until the model has been fitted appropriately.
Supervised learning requires the training set, which represents the connections between input parameters (features) and the desired output (target) - the type of the target variable defines the type of the corresponded Data Mining problem (classification or regression):
  • Classification is the supervised learning problem when the target variable is discrete. Classification is connected with finding a function, which performs the recognition of the object's class based on its features.
    • Examples: email spam detection, Image recognition, credit scoring, user compliance categorization.
  • Regression is the supervised learning problem when the target variable is continuous. Regression is used to understand the relationship between dependent and independent variables.
    • Examples: market trends prediction, time-series forecasting, risk assessment
Predictive Model brick automatically performs the detecting of the supervised learning problem's type, based on the selected target variable, as well as the selection of the input features that are appropriate for the modeling. There are two modes of Predictive Model brick settings:
  • Simple mode - the user should define the target variable only, and the rest will be made automatically - the component defines the data mining problem, choose the list of predictors, select the appropriate model, and makes it tunning
  • Advanced mode - the user may not only define the target variable but select the type of data mining problem and compound the list of predictors.

Description

Brick Location

Bricks Analytics → AutoML Predictive Model
Bricks Analytics → Data Mining / ML → Classification Models Predictive Model
Bricks Analytics → Data Mining / ML → Regression Models Predictive Model

Brick Parameters

  • Target Variable
    • The column that we want the model to predict. This variable can be both continuous and discrete (categorical) - this defines the type of the data mining problem (classification or regression).
  • Quick run
    • The binary flag, which determines the model fitting scenario - if True, the model will be tuned with the default parameters without the hyper-parameters tuning, if False - we sacrifice the computational performance in favor of model precision.
  • Select Problem
    • Advances option. A drop-down menu that allows selecting the desired data mining problem.
  • Filter Columns Settings
    • Advances option. This is aimed at the predictor's list composition.
      Columns
      List of possible columns for selection. It is possible to choose several columns for filtering by clicking on the '+' button in the brick settings and specify the way of their processing:
      • remove all mentioned columns from the dataset and proceed with the rest ones as with predictors
      • use the selected columns as predictors and proceed with them
      Remove all except selected
      The binary flag, which determines the behavior in the context of the selected columns

Brick Inputs/Outputs

  • Inputs
    • Brick takes the data set with a target column that meets the requirements to the supervised machine learning problem solving.
  • Outputs
    • Brick produces two outputs as the result:
    • Data - modified input data set with added columns for predicted classes or classes' probability
    • Model - trained model that can be used in other bricks as an input

Outcomes

  • Model Performance
    • This button (located in the Deployment section) gives you a possibility to check the model's performance (a.k.a. metrics) to then adjust your pipeline if needed.
      Model performance metrics depend on the solved data mining problem:
      Classification
    • Supported metrics: accuracy, precision, recall, f1-score, ROC AUC, Gini
    • Plots: feature importance, classification report, Class Error report, ROC curve, Precision-Recall curve, Discrimination threshold, confusion matrix
    • Regression
    • Supported metrics: RMSE, MAPE, R2
  • Save model asset
    • This option provides a mechanism to save your trained models to use them in other projects. For this, you will need to specify the model's name or you can create a new version of an already existing model (you will need to specify the new version's name).
  • Download model asset
    • Use this feature, if you want to download the model's asset to use it outside the Datrics platform.

Example of usage

Let's try to predict which Titanic passengers are more likely to survive the Titanic shipwreck. For this purpose, we may take the titanic.csv dataset and connect it with the Predictive Model brick.

Dataset description

  • passengerid (category/int) - ID of passenger
  • name (category/string) - Passenger's nhame
  • pclass (category/int) - Ticket class
  • sex (category/string) - Gender
  • age (numeric) - Age in years
  • sibsp (numeric) - Number of siblings / spouses aboard the Titanic
  • parch (category/int) - Number of parents / children aboard the Titanic
  • ticket (category/string) - Ticket number (contains letters)
  • fare (numeric) - Passenger fare
  • cabin (category/string) - Cabin number (contains letters)
  • embarked (category/string) - Port of Embarkation
  • survived (category/binary) - indicator of the passenger surviving
notion image

Executing simple-mode pipeline

Next steps would be made to build simple test pipeline:
  • First, drag-drop titanic.csv file from Storage→Samples folder and Predictive Model from Bricks → Analytics → AutoML
  • Connect the data set to Predictive Model brick, and choose the target variable "survived"
  • Run pipeline
notion image
Some of the columns from the input dataset can't be used as a model's predictors, so they were excluded from the features list, and the user gets the corresponded notification.
 
notion image
If we open the "What-If" dashboard, we may see the list of predictors and check the model results for the different sets of the input parameters. As we can see, the model's feature vector contains numerical parameters such as age, fare, parch, pclass, and silbsp.
 
notion image
As was expected, the data mining problem was considered as a Classification problem, so the Model Performance report depicts the metrics and plots that are related to the classification performance.
notion image
The results are depicted in the table:
notion image

Executing advanced-mode pipeline

  • Drag-drop titanic.csv file from Storage→Samples folder and Predictive Model from Bricks → Analytics → AutoML
  • Connect the data set to Predictive Model brick, and choose the target variable "survived"
  • Switch to the Advanced Mode
  • Select Regression problem
  • Select age and fare as predictors
  • Run pipeline
notion image
Now, we get the solution of the regression problem and the model produces the value that can be considered as a survival score.
notion image

Model Save and Deployment