Predictive Model

General Information

This component is a universal solution for "supervised machine learning problem" solving. Supervised machine learning can be defined as learning a function that performs the mapping of the independent characteristics that describe objects or phenomena to the expected outcomes (categories, values, etc.). The inference of the analytical function is performed based on the trained examples - the model fitting process lies in feeding the data into the model and iterative adjustment of the model's weights until the model has been fitted appropriately.

Supervised learning requires the training set, which represents the connections between input parameters (features) and the desired output (target) - the type of the target variable defines the type of the corresponded Data Mining problem (classification or regression):

Classification is the supervised learning problem when the target variable is discrete. Classification is connected with finding a function, which performs the recognition of the object's class based on its features.

Examples: email spam detection, Image recognition, credit scoring, user compliance categorization.

Regression is the supervised learning problem when the target variable is continuous. Regression is used to understand the relationship between dependent and independent variables.

Examples: market trends prediction, time-series forecasting, risk assessment

Predictive Model brick automatically performs the detecting of the supervised learning problem's type, based on the selected target variable, as well as the selection of the input features that are appropriate for the modeling. There are two modes of Predictive Model brick settings:

Simple mode - the user should define the target variable only, and the rest will be made automatically - the component defines the data mining problem, choose the list of predictors, select the appropriate model, and makes it tunning

Advanced mode - the user may not only define the target variable but select the type of data mining problem and compound the list of predictors.

Description

Brick Location

Bricks → Analytics → AutoML → Predictive Model

Bricks → Analytics → Data Mining / ML → Classification Models → Predictive Model

Bricks → Analytics → Data Mining / ML → Regression Models → Predictive Model

Brick Parameters

Target Variable

The column that we want the model to predict. This variable can be both continuous and discrete (categorical) - this defines the type of the data mining problem (classification or regression).

Quick run

The binary flag, which determines the model fitting scenario - if True, the model will be tuned with the default parameters without the hyper-parameters tuning, if False - we sacrifice the computational performance in favor of model precision.

Select Problem

Advances option. A drop-down menu that allows selecting the desired data mining problem.

Filter Columns Settings

Advances option. This is aimed at the predictor's list composition.

Columns

List of possible columns for selection. It is possible to choose several columns for filtering by clicking on the '+' button in the brick settings and specify the way of their processing:

remove all mentioned columns from the dataset and proceed with the rest ones as with predictors

use the selected columns as predictors and proceed with them

Remove all except selected

The binary flag, which determines the behavior in the context of the selected columns

Brick Inputs/Outputs

Inputs

Brick takes the data set with a target column that meets the requirements to the supervised machine learning problem solving.

Outputs

Brick produces two outputs as the result:

Data - modified input data set with added columns for predicted classes or classes' probability
Model - trained model that can be used in other bricks as an input

Outcomes

Model Performance

This button (located in the Deployment section) gives you a possibility to check the model's performance (a.k.a. metrics) to then adjust your pipeline if needed.

Model performance metrics depend on the solved data mining problem:

Classification

Supported metrics: accuracy, precision, recall, f1-score, ROC AUC, Gini
Plots: feature importance, classification report, Class Error report, ROC curve, Precision-Recall curve, Discrimination threshold, confusion matrix

Regression

Supported metrics: RMSE, MAPE, R2

Save model asset

This option provides a mechanism to save your trained models to use them in other projects. For this, you will need to specify the model's name or you can create a new version of an already existing model (you will need to specify the new version's name).

Download model asset

Use this feature, if you want to download the model's asset to use it outside the Datrics platform.

Example of usage

Let's try to predict which Titanic passengers are more likely to survive the Titanic shipwreck. For this purpose, we may take the titanic.csv dataset and connect it with the Predictive Model brick.

Dataset description

passengerid (category/int) - ID of passenger

name (category/string) - Passenger's nhame

pclass (category/int) - Ticket class

sex (category/string) - Gender

age (numeric) - Age in years

sibsp (numeric) - Number of siblings / spouses aboard the Titanic

parch (category/int) - Number of parents / children aboard the Titanic

ticket (category/string) - Ticket number (contains letters)

fare (numeric) - Passenger fare

cabin (category/string) - Cabin number (contains letters)

embarked (category/string) - Port of Embarkation

survived (category/binary) - indicator of the passenger surviving

Executing simple-mode pipeline

Next steps would be made to build simple test pipeline:

First, drag-drop titanic.csv file from Storage→Samples folder and Predictive Model from Bricks → Analytics → AutoML

Connect the data set to Predictive Model brick, and choose the target variable "survived"

Run pipeline

Some of the columns from the input dataset can't be used as a model's predictors, so they were excluded from the features list, and the user gets the corresponded notification.

If we open the "What-If" dashboard, we may see the list of predictors and check the model results for the different sets of the input parameters. As we can see, the model's feature vector contains numerical parameters such as age, fare, parch, pclass, and silbsp.

As was expected, the data mining problem was considered as a Classification problem, so the Model Performance report depicts the metrics and plots that are related to the classification performance.

The results are depicted in the table:

Executing advanced-mode pipeline

Drag-drop titanic.csv file from Storage→Samples folder and Predictive Model from Bricks → Analytics → AutoML

Connect the data set to Predictive Model brick, and choose the target variable "survived"

Switch to the Advanced Mode

Select Regression problem

Select age and fare as predictors

Run pipeline

Now, we get the solution of the regression problem and the model produces the value that can be considered as a survival score.

Model Save and Deployment