Logistic Regression

General information

The brick provides a possibility to create your own logistic regression model to solve classification problems. The logistic regression model uses the logistic function to squeeze the output of a linear equation between 0 and 1. The logistic function is defined as:

This model can help you predict the likelihood of an event happening or a choice being made. For example, you may want to know the likelihood of a visitor choosing an offer made on your website — or not. Logistic model in our interpretation is used to solve binary classification problem.

Description

Brick Location

Bricks → Machine Learning → Logisitc Regression

Brick Parameters

Simple mode

Regularization

Optional parameter for regularization where you can choose between L1 and L2 terms. L1 and L2 terms are regularization techniques to prevent model overfitting. The main idea is quite simple we just add another term to our cost function both regularization presented below:

Regularization is a technique used for tuning the function by adding penalty term in the error function, which helps overcome overfitting. The model supports the next types of regularization:

Lasso Regression (L1) - Least Absolute Shrinkage and Selection Operator, is a type of linear regression that uses shrinkage. Shrinkage is where data values are shrunk towards a central point like “mean”. The lasso procedure encourages simple, sparse models (i.e. models with fewer parameters). This particular type of regression is well-suited for models showing high levels of multicollinearity or when you want to automate certain parts of model selection, like variable selection/parameter elimination. The cost function for Lasso regression is:

Ridge Regression (L2) - (also known as Tikhonov regularization), ridge regression shrinks the coefficients and it helps to reduce the model complexity and multi-collinearity

Balancing

There are three options for balancing parameter: none, auto, weighting. This parameter helps to balance your classes so that they become equal or at least close to equal. In case of choosing none or auto there is no need to do anything else but when you choose weighting it engage you to also choose a column with weights.

Class/Probability of class

There is an option where you can choose output as a class predicted or probability of class.

Target variable

Parameter to choose target column from all columns so that the model can learn how to classify objects.

Filter column setting (columns)

If you have columns in your data that need to be ignored (but not removed from the data set) during the training process (and later during the predictions), you should specify them in this parameter. To select multiple columns, click the '+' button in the brick settings.

In addition, you can ignore all columns except the ones you specified by enabling the "Remove all except selected" option. This may be useful if you have a large number of columns while the model should be trained just on some of them.

Advanced mode

Looks the same except there are two more parameters to control:

Optimization mode

There is Recursive Features Elimination that we help to deal with useless features in your data

Train Explainer

It is parameter that will build model explainer for API usage, you can learn more deatails here.

Additional Features

What-if This option gives access to the information for the Model Deployment service, as well as a possibility to call API using custom values.

Model Performance

Gives you a possibility to check the model's performance (a.k.a. metrics) to then adjust your pipeline if needed.

There are some metrics presented below:

Feature Importance - shows how many influence each feature has for final prediction

Coefficients Summary - matrix of some coefficients for some features

Classification Report - stats for classification results

Class Error Report - shows histogram of errors

ROC AUC Curve - ROC metric is generally good for measuring model performance

Precision Recall Curve - graph for precision and recall metrics

Discrimination Report - ??

Model Scores Distribution - model score graph

Confusion Matrix - summary of prediction results

Save model

This option provides a mechanism to save your trained models to use them in other projects. For this, you will need to specify the model's name or you can create a new version of an already existing model (you will need to specify the new version's name).

Download

Use this feature, if you want to download model's asset to use it outside Datrics platform.

Meta-data

Option to access meta data for pipeline

Example of usage

Let's consider a simple titanic classification problem, it is a problem of binary classification, in the end we need to predict if passendger on titanic survived or not . We have the next variables:

passengerid (category) - ID of passenger

name (category) - Passenger's name

pclass (category) - Ticket class

sex (category) - Gender

age (numeric) - Age in years

sibsp (numeric) - Number of siblings / spouses aboard the Titanic

parch (category) - Number of parents / children aboard the Titanic

ticket (category) - Ticket number

fare (numeric) - Passenger fare

cabin (category) - Cabin number

embarked (category) - Port of Embarkation

survived (boolean) - True/False

Then we use Missing Values Treatment brick to fill some missing values and then filtering some rows to fit model, settings presented below:

Let’s move to results: