Variable Selection

General information

Brick performs the selection of the independent variables that are characterized by sufficient predictive power with respect to the binary target variable. The predictive power of the variable is reflected via Information Value (IV) (please, see
Information Value
for the details):
Information Value explanation
Information Value
Variable Predictiveness
Extremely Weak predictive Power
Weak predictive Power
Moderate predictive Power
Strong predictive Power
Very Strong Predictive Power

Description

Brick Location

Bricks Data Manipulation → Filter Variable Selection
Bricks Analytics → Features Engineering Variable Selection
Bricks Analytics → Data Insights Variable Selection
Bricks Use Cases → Credit Scoring → Features Engineering Variable Selection

Brick Parameters

  • Target
    • A binary variable that is used as a target variable in a binary classification problem. The information value of the separate predictors is calculated with respect to the specified target. The target variable should present in the input dataset and takes two values - (0, 1).
  • Columns to exclude
    • List of columns that are going to be excluded from the analysis. These columns will be passed to the output dataset regardless of their predictive power. It is possible to choose several columns for filtering by clicking on the '+' button in the brick settings.
  • Threshold
    • Minimal value of the Information Value score, which allows considering the variable (predictor) as a variable with sufficient predictive power, so that we may consider it as appropriate for further predictive modeling.

Brick Inputs/Outputs

  • Inputs
    • Brick takes the dataset, which contains the binary target variable and independent predictors
  • Outputs
    • Brick produces the dataset, which contains:
    • Columns that are considered as informative predictors
    • Binary target variable
    • Columns, which were excluded from the analysis

Example of usage

Let's consider the binary classification problem
🛥️
Binary classification : Titanic
. The inverse target variable takes two values - survived (0) - good or non-event case / not-survived (1) - bad or event case. The general information about predictors is represented below:
  • passengerid (category) - ID of passenger
  • name (category) - Passenger's name
  • pclass (category) - Ticket class
  • sex (category) - Gender
  • age (numeric) - Age in years
  • sibsp (numeric) - Number of siblings / spouses aboard the Titanic
  • parch (category) - Number of parents / children aboard the Titanic
  • ticket (category) - Ticket number
  • fare (numeric) - Passenger fare
  • cabin (category) - Cabin number
  • embarked (category) - Port of Embarkation
As we know, the variables with good predictive power should be characterized by Information Value higher than 0.5. Information Value Brick may help us to make the preliminary assessment of the dataset (please, see
Information Value
for the details):

notion image

Now we may see that predictors sex, fare, pclass, and embarked can be considered as informative with IV threshold equals to 0.5.
For instance, we would like to leave the informative variables only, but save the passengers' names in the output dataset as identifiers. For this purpose, we should put Variable Selection brick to the pipeline and configure it:
  • pass the input data set to the Variable Selection brick
  • define the target (survived column)
  • add column name to the Columns to exclude list
  • set Information Value Threshold equals to 0.05
  • run pipeline

Assessment Information Values of the predictors

First of all, we may get a general assessment of the predictors in the context of their ability to predict the target variable. For this purpose we should:
  • pass the input data set to the Information Value brick
  • define the target (survived column)
  • choose the "Information Value only" mode
  • run pipeline

notion image

In order to see the assessment result, you should open the Output data previewer on the right sidebar.

notion image

The result is depicted in the table:

notion image

As we can see, the result dataset contains all informative variables (sex, fare, pclass, and embarked), saved column - name, and target variable - survived. The rest columns were excluded as non-informative ones.