Logistic Regression

General information

The brick provides a possibility to create your own logistic regression model to solve classification problems. The logistic regression model uses the logistic function to squeeze the output of a linear equation between 0 and 1. The logistic function is defined as:
This model can help you predict the likelihood of an event happening or a choice being made. For example, you may want to know the likelihood of a visitor choosing an offer made on your website — or not. Logistic model in our interpretation is used to solve binary classification problem.

Description

Brick Location

Bricks Machine Learning Logisitc Regression

Brick Parameters

Simple mode

  • Regularization
    • Optional parameter for regularization where you can choose between L1 and L2 terms. L1 and L2 terms are regularization techniques to prevent model overfitting. The main idea is quite simple we just add another term to our cost function both regularization presented below:
      Regularization is a technique used for tuning the function by adding penalty term in the error function, which helps overcome overfitting. The model supports the next types of regularization:
    • Lasso Regression (L1) - Least Absolute Shrinkage and Selection Operator, is a type of linear regression that uses shrinkage. Shrinkage is where data values are shrunk towards a central point like “mean”. The lasso procedure encourages simple, sparse models (i.e. models with fewer parameters). This particular type of regression is well-suited for models showing high levels of multicollinearity or when you want to automate certain parts of model selection, like variable selection/parameter elimination. The cost function for Lasso regression is:
    • Ridge Regression (L2) - (also known as Tikhonov regularization), ridge regression shrinks the coefficients and it helps to reduce the model complexity and multi-collinearity
       
  • Balancing
    • There are three options for balancing parameter: none, auto, weighting. This parameter helps to balance your classes so that they become equal or at least close to equal. In case of choosing none or auto there is no need to do anything else but when you choose weighting it engage you to also choose a column with weights.
  • Class/Probability of class
    • There is an option where you can choose output as a class predicted or probability of class.
  • Target variable
    • Parameter to choose target column from all columns so that the model can learn how to classify objects.
  • Filter column setting (columns)
    • If you have columns in your data that need to be ignored (but not removed from the data set) during the training process (and later during the predictions), you should specify them in this parameter. To select multiple columns, click the '+' button in the brick settings.
      In addition, you can ignore all columns except the ones you specified by enabling the "Remove all except selected" option. This may be useful if you have a large number of columns while the model should be trained just on some of them.

Advanced mode

Looks the same except there are two more parameters to control:
  • Optimization mode
    • There is Recursive Features Elimination that we help to deal with useless features in your data
  • Train Explainer
    • It is parameter that will build model explainer for API usage, you can learn more deatails here.

Additional Features

  • What-if This option gives access to the information for the Model Deployment service, as well as a possibility to call API using custom values.
  • Model Performance
    • Gives you a possibility to check the model's performance (a.k.a. metrics) to then adjust your pipeline if needed.
      There are some metrics presented below:
    • Feature Importance - shows how many influence each feature has for final prediction
    • notion image
    • Coefficients Summary - matrix of some coefficients for some features
    • notion image
    • Classification Report - stats for classification results
    • notion image
    • Class Error Report - shows histogram of errors
    • notion image
    • ROC AUC Curve - ROC metric is generally good for measuring model performance
    • notion image
    • Precision Recall Curve - graph for precision and recall metrics
    • notion image
    • Discrimination Report - ??
    • notion image
    • Model Scores Distribution - model score graph
    • notion image
    • Confusion Matrix - summary of prediction results
    • notion image
  • Save model
    • This option provides a mechanism to save your trained models to use them in other projects. For this, you will need to specify the model's name or you can create a new version of an already existing model (you will need to specify the new version's name).
  • Download
    • Use this feature, if you want to download model's asset to use it outside Datrics platform.
  • Meta-data
    • Option to access meta data for pipeline

Example of usage

Let's consider a simple titanic classification problem, it is a problem of binary classification, in the end we need to predict if passendger on titanic survived or not . We have the next variables:
  • passengerid (category) - ID of passenger
  • name (category) - Passenger's name
  • pclass (category) - Ticket class
  • sex (category) - Gender
  • age (numeric) - Age in years
  • sibsp (numeric) - Number of siblings / spouses aboard the Titanic
  • parch (category) - Number of parents / children aboard the Titanic
  • ticket (category) - Ticket number
  • fare (numeric) - Passenger fare
  • cabin (category) - Cabin number
  • embarked (category) - Port of Embarkation
  • survived (boolean) - True/False
notion image
Then we use Missing Values Treatment brick to fill some missing values and then filtering some rows to fit model, settings presented below:
Let’s move to results:
notion image