Binning

General information

Brick provides a powerful interactive interface for the numerical variable's quantization with respect to the binary target variable. The quality of the binning is assessed considering the predictive power of the resulted categorical variable (please, see for the details) so that guarantees that the categorization is performed the best way.
Binning dashboard supports four binning strategies:
  • Quantile-based auto-binning. The resulted categories are the results of the dividing of the processed data into n equal-sized bins
  • Range-based auto-binning. The resulted categories are the results of the dividing of the processed data into n equal-sized ranges.
  • IV optimal auto binning. The obtained bins provide the best Information Value of the resulted categorical variable. Optimal bins are gotten via tree-based binning with the optimization of the model parameters with respect to the Information Value of the resulted categories. The basic idea of the tree-based approach is to quantize the analyzed variable via tuning the decision tree that inference the binary target variable from numerical predictors. The final decision rules are interpreted as a binning procedure.
  • Manual binning. User can adjust the auto-binning results via manual manipulation with the obtained bins - their union, splitting, renaming, etc. The quality of bins also can be analysed via Information Value assessment.

Description

Brick Location

Bricks Data Manipulation → Transform Binning
Bricks Analytics → Feature Engineering Binning
Bricks Use Cases → Credit Scoring → Feature Engineering Binning

Brick Parameters

  • Target
    • A binary variable is used as a target variable in a binary classification problem. The information value of the separate predictors is calculated with respect to the specified target. The target variable should be present in the input dataset and take two values - (0, 1).
      Target variable is optional and is used for the binning quality assessment.
  • Default number of bins
    • The default number of bins that is used for the auto-binning at the first run.
  • Columns to binning (configured via dashboard)
    • User may configure the list of numerical variables that have to be categorized via selecting/unselecting the available columns on the Binning Dashboard
  • Binning Rules (configured via dashboard)
    • Binning rules are created after the binning configuration and can be adjusted via Binning dashboard
      Default Parameters
      At the first run binning is performed with the initial settings - for all numerical variables quantile-based auto-binning with n=5 is applied.

Brick Inputs/Outputs

  • Inputs
    • Brick takes the dataset, which contains the binary target variable and independent predictors
  • Outputs
    • Brick produces the dataset with extra columns - the results go the numerical variables binning

Dashboard Description

notion image
  1. Target variable selection. This parameter is optional, but it should be used if the user wants to assess the quality of the categorized variables and the separate bins considering the binary target variable.
  1. The default number of bins that is used for the binning initializing.
  1. "Variable for the binning" checkbox. If selected - the corresponded variable will be processed, categorized, and the extra column named <variable>_bins will be produced to the output dataset.
  1. Numerical variable info-box. This box contains the information about a variable, including bins quality assessment:
      • Variable name
      • "Variable for the binning" checkbox
      • Bins list
      • Categorical variable predictive power
  1. Numerical variable histogram with the bins boundaries visualization
  1. Auto-binning parameters. There are three auto-binning methods are available, Quantile and Rage-based methods are configuring with the number of the expected categories. In order to get the result of the auto-categorization, the button "Generate Bins" should be pressed. Note, the auto-binning is available for the selected variables only
    1. The "Remove outliers" checkbox defines the histogram view (either histogram includes the outliers or not). Also, if "Remove outliers" is turned off, auto-bins will be generated for all data, including outliers.
  1. Manual binning form. The table contains the result of both auto and manual binning:
      • Bin Name - the name of the category that will be used in the categorised variable
      • X0 - the low boundary of the bin
      • X1 - the high boundary of the bin
      • WoE - Weight of Evidence of the category
      • IV - Information Value of the category
      • Brackets (square and parentheses) - define if the margins will be included in the corresponded bin or not. For instance, if the bing "child" of the variable "Age" is described via range [0, 12), it means that to the category "child" could be referred the person younger than 12 years old only.
      In manual mode, the user has the possibility to unite and/or split the bins and rename the resulting categories. For applying these settings, the "Apply" button should be pressed. Note, the manual binning is available for the selected variables only.
      The button "Save" allows saving the manual rules so that they will be applied after configuring all columns
      "Reset" button allows to reset the variable binning to the default settings.

Example of usage

Let's consider the binary classification problem . The inverse target variable takes two values - survived (0) - good or non-event case / not-survived (1) - bad or event case. The general information about predictors is represented below:
  • passengerid (category) - ID of passenger
  • name (category) - Passenger's name
  • pclass (category) - Ticket class
  • sex (category) - Gender
  • age (numeric) - Age in years
  • sibsp (numeric) - Number of siblings / spouses aboard the Titanic
  • parch (category) - Number of parents / children aboard the Titanic
  • ticket (category) - Ticket number
  • fare (numeric) - Passenger fare
  • cabin (category) - Cabin number
  • embarked (category) - Port of Embarkation
Most of the predictors are categorical or take a few unique values, but age and fare - are numerical variables that might be useful to categorize, especially if we are going to create a scoring model that returns Survival Score. In this case, we may put the Binning brick in the pipeline and get the categorical feature-vector for further modeling.
Let's create the simple pipeline and configure the Binning brick:
  1. Create the pipeline with Data Import (titanic.csv), connect the Binning brick, and define the Target Variable (survived)
    1. notion image
  1. Run pipeline and open Binning dashboard (press Binning Adjustment)
    1. notion image
  1. Now we can adjust the binning and rename the categories. For instance, we would like to get the most optimal categorization from IV prospective for age variable:
    1. Select and check age column - Button "Generate Bins" becomes available
    2. Select IV optimal auto binning mode
    3. Press "Generate Bins"
      1. notion image
    4. You receive the warning "All manual changes for the current column will be replaced with auto-generated bins. Would you like to proceed?". Press "Yes"
      1. notion image
        As we can see, the total number of bins is changed - from the default 5 to 4 and the Information Value of the variable Age increases from 0.02 to 0.065.
  1. We may edit the resulting bins with respect to the accepted categories and give the more explicit names to the categories - child, young, adult, senior:
    1. notion image
    2. Edit and rename bins
    3. Press "Apply"
      1. notion image
  1. Return to the pipeline and open the output dataset previewer
    1. notion image
      As we can see, the dataset was extended with age_bins column with categories, that are described by the rules above.
      Note, that dataset contains the observations with values that were considered as outliers, like, for instance 80 years old. These values are described as new category (max margin, Inf) or (-Inf, min margin) if the value is the low outlier.
  1. Include the outliers to the manually-defined binsFirst
    1. Open the Binning dashboard
    2. First of all, lets add the outliers to the histogram:
      1. Uncheck the "Remove outliers" flag
      2. Press "Apply"
        1. notion image
          We can see that the histogram has additional bins on the right
      3. Edit the minimal and maximal margins - set 0 and 150 correspondingly
      4. Press "Apply"
        1. notion image
      5. Open Data Previewer - now you see, that the 80 years old person is considered a senior
        1. notion image
         

Final Pipeline

The final pipeline contains Encoding brick for the binarization of the categorical variables and Modelling-Scoring components
notion image

Pipeline Configuration

Import Minio
"Import Minio": { "brick": "Import Minio", "brick_description": "Imports a file from Minio to the platform", "brick_arguments": { "path": "samples/titanic.csv", "bucket": "storage.testing.datrics.ai", "pathOriginal": "samples/titanic.csv", "frozen_configuration": {} }}
Binning
"Binning": { "brick": "Binning", "brick_arguments": { "action": "calculate", "binning_rules": { "age": { "column": "age", "bins": [ { "include_margin": [ false, false ], "is_locked": true, "iv": 0.03983716989903174, "name": "child", "range": [ 0, 12 ], "woe": -0.65977814494985 }, { "include_margin": [ true, false ], "is_locked": true, "iv": 0.0064576678726842255, "name": "young", "range": [ 12, 21 ], "woe": 0.1872559583684909 }, { "include_margin": [ true, true ], "is_locked": true, "iv": 1.699951979011563E-7, "name": "adult", "range": [ 21, 60 ], "woe": -0.0004957362184482255 }, { "include_margin": [ false, true ], "is_locked": true, "iv": 0.027512355417335172, "name": "senior", "range": [ 60, 150 ], "woe": 1.048617884608208 } ], "iv": 0.0738, "checked": true, "autoConfig": { "algorithm": "ivbin" }, "remove_outliers": false, "algorithm": "manual" } }, "n_bins_default": 5, "columns_for_calculate": null, "configuration": {}, "target": "survived", "time_stamp": 1635775703504991500 }}
Encoding
"Encoding": { "brick": "Encoding", "brick_description": "Performs one-hot encoding or converts variable to the category data type", "brick_arguments": { "columns": [ { "column": "pclass", "encoding": "one-hot", "additional_categories": null }, { "column": "parch", "encoding": "one-hot", "additional_categories": null }, { "column": "embarked", "encoding": "one-hot", "additional_categories": null }, { "column": "age_bins", "encoding": "one-hot", "additional_categories": null }, { "column": "sex", "encoding": "one-hot", "additional_categories": null } ] }}
Filter Columns
"Filter Columns": { "brick": "Filter Columns", "brick_description": "Filters out the columns from a dataset", "brick_arguments": { "columns_list": [ "name", "age", "ticket", "cabin", "fare", "sibsp" ], "leave_flag": false }},
Variable Selection
"Variable Selection": { "brick": "Variable Selection", "brick_description": "Performs the selection of the independent variables that are characterized by sufficient predictive power with respect to the binary target variable. The predictive power of the variable is reflected via Information Value (IV).", "brick_arguments": { "target": "survived", "exclude_columns": null, "threshold": 0.02 }}, "Feature Selection": { "brick": "Feature Selection", "brick_description": "Feature Selection is the process of reducing the number of input variables", "brick_arguments": { "columns_to_exclude": [ "survived" ], "method": "Variance Inflation Factor", "threshold": 5, "target": "", "n_features": 0 }}
Split Data
"Split Data": { "brick": "Split Data", "brick_description": "Splits a dataset into two, allows for stratification", "brick_arguments": { "test_size": 0.1, "shuffle": false }}
Logistic Regression
"Logistic Regression": { "brick": "Logistic Regression", "brick_description": "Binary classification model, in its basic form uses a logistic function.", "brick_arguments": { "regularization": null, "probability": "class", "weight_column": "", "class_balancing": "none", "optimize": false, "optimize_mode": null, "is_train_explainer": false, "target_variable": "survived", "is_show_advanced_mode": true }}
ScoreCard
"ScoreCard": { "brick": "ScoreCard", "brick_description": "Performs the transformation of Logistic Regression Model coefficients to the scorecard.Transformation can be performed with/without taking into account features Weight of Evidence", "brick_arguments": { "scaling_type": "odds_scaling", "scaling_params": { "pdo": 20, "target_odds": 50, "target_score": 600 }, "woe_correction": true, "columns_list": [], "leave_flag": false, "target": "survived" }}
Predict
"Predict": { "brick": "Predict", "brick_description": "Generates predictions for the trained model using given dataset", "brick_arguments": {}}

Results

Let's consider the pipeline results. The "Model Performance" dashboard of the Logistics Regression model showed that the most influenced features are "sex_male", "age_bins_child", "pclass_3.0" and "pclass_1.0"
notion image
If we look at the scorecard, we can see, that the conditions "pclass_1.0" = True and "age_bins_child" = True significantly decrease the risk to "not-survived", while "sex_male" increases this risk.
notion image