Brick provides a powerful interactive interface for the numerical variable's quantization with respect to the binary target variable. The quality of the binning is assessed considering the predictive power of the resulted categorical variable (please, see for the details) so that guarantees that the categorization is performed the best way.
Binning dashboard supports four binning strategies:
- Quantile-based auto-binning. The resulted categories are the results of the dividing of the processed data into n equal-sized bins
- Range-based auto-binning. The resulted categories are the results of the dividing of the processed data into n equal-sized ranges.
- IV optimal auto binning. The obtained bins provide the best Information Value of the resulted categorical variable. Optimal bins are gotten via tree-based binning with the optimization of the model parameters with respect to the Information Value of the resulted categories. The basic idea of the tree-based approach is to quantize the analyzed variable via tuning the decision tree that inference the binary target variable from numerical predictors. The final decision rules are interpreted as a binning procedure.
- Manual binning. User can adjust the auto-binning results via manual manipulation with the obtained bins - their union, splitting, renaming, etc. The quality of bins also can be analysed via Information Value assessment.
Bricks → Data Manipulation → Transform → Binning
Bricks → Analytics → Feature Engineering → Binning
Bricks → Use Cases → Credit Scoring → Feature Engineering → Binning
A binary variable that is used as a target variable in a binary classification problem. The information value of the separate predictors is calculated with respect to the specified target. The target variable should present in the input dataset and takes two values - (0, 1).
- Columns to binning (configured via dashboard)
User may configure the list of numerical variables that have to be categorized via selecting/unselecting the available columns on the Binning Dashboard
- Binning Rules (configured via dashboard)
Binning rules are created after the binning configuration and can be adjusted via Binning dashboard
At the first run binning is performed with the initial settings - for all numerical variables quantile-based auto-binning with n=5 is applied.
Brick takes the dataset, which contains the binary target variable and independent predictors
Brick produces the dataset with extra columns - the results go the numerical variables binning
- Target variable selection. This parameter is mandatory because the binning is performed considering the binary target variable.
- "Variable for the binning" checkbox. If selected - the corresponded variable will be processed, categorized and the extra column named <variable>_bins will be produced to the output dataset.
- Numerical variable info-box. This box contains the information about a variable, including bins quality assessment:
- Variable name
- "Variable for the binning" checkbox
- Bins list
- Categorical variable predictive power
- Numerical variable histogram with the bins boundaries visualization
- Auto-binning parameters. There are three auto-binning methods are available, Quantile and Rage based methods are configuring with the number of the expected categories. In order to get the result of the auto-categorization, the button "Auto" should be pressed. Note, the auto-binning is available for the selected variables only
- Manual binning form. The table contains the result of both auto and manual binning:
- Bin Name - the name of the category that will be used in the categorised variable
- X0 - the low boundary of the bin
- X1 - the high boundary of the bin
- WoE - Weight of Evidence of the category
- IV - Information Value of the category
In the manual mode, the user has the possibility to unite and/or split the bins and rename the resulted categories. For applying these settings, the "Compute" button should be pressed. Note, the manual binning is available for the selected variables only.
"Reset" button allows to reset the variable binning to the default settings.
Let's consider the binary classification problem . The inverse target variable takes two values - survived (0) - good or non-event case / not-survived (1) - bad or event case. The general information about predictors is represented below:
- passengerid (category) - ID of passenger
- name (category) - Passenger's name
- pclass (category) - Ticket class
- sex (category) - Gender
- age (numeric) - Age in years
- sibsp (numeric) - Number of siblings / spouses aboard the Titanic
- parch (category) - Number of parents / children aboard the Titanic
- ticket (category) - Ticket number
- fare (numeric) - Passenger fare
- cabin (category) - Cabin number
- embarked (category) - Port of Embarkation
Most of the predictors are categorical or take a few unique values, but age and fare - are numerical variables that might be useful to categorize, especially if we are going to create a scoring model that returns Survival Score. In this case, we may put the Binning brick in the pipeline and get the categorical feature-vector for further modeling.
Let's create the simple pipeline and configure the Binning brick:
- Create the pipeline with Data Import (titanic.csv) and Typization brick
- Connect the Binning brick and open Binning dashboard (press Binning Adjustment)
- Define the Target Variable (survived)
- Run Pipeline. After pipeline execution the Binning dashboard is ready for the configuration
- Now we can adjust the binning and rename the categories. For instance, we would like to get the most optimal categorization from IV prospective for age variable:
- Select and check age column - Button "Auto" becomes available
- Select IV optimal auto binning mode
- Press "Auto"
As we can see, optimal binning allowed to increase the final Information Value (fro, 0.02 to 0.07)
- We may rename the resulted bins and give the more explicit names to the categories - child, young, adult, adult-senior, senior
- Rename columns
- Press "Calculate"
- Return to the pipeline and open the output dataset previewer
As we can see, the dataset was extended with age_bins column with categories, that are described by the rules above.
The final pipeline contains Encoding brick for the binarization of the categorical variables and Modelling-Scoring components