Variable Selection

General information

Brick performs the selection of the independent variables that are characterized by sufficient predictive power with respect to the binary target variable. The predictive power of the variable is reflected via Information Value (IV) (please, see

Information Value for the details):

Information Value explanation

Information Value

Variable Predictiveness

< 0.02

Extremely Weak predictive Power

from 0.02 to 0.1

Weak predictive Power

from 0.1 to 0.3

Moderate predictive Power

from 0.3 to 0.5

Strong predictive Power

> 0.5

Very Strong Predictive Power

Description

Brick Location

Bricks → Data Manipulation → Filter → Variable Selection

Bricks → Analytics → Features Engineering → Variable Selection

Bricks → Analytics → Data Insights → Variable Selection

Bricks → Use Cases → Credit Scoring → Features Engineering → Variable Selection

Brick Parameters

Target

A binary variable that is used as a target variable in a binary classification problem. The information value of the separate predictors is calculated with respect to the specified target. The target variable should present in the input dataset and takes two values - (0, 1).

Columns to exclude

List of columns that are going to be excluded from the analysis. These columns will be passed to the output dataset regardless of their predictive power. It is possible to choose several columns for filtering by clicking on the '+' button in the brick settings.

Threshold

Minimal value of the Information Value score, which allows considering the variable (predictor) as a variable with sufficient predictive power, so that we may consider it as appropriate for further predictive modeling.

Brick Inputs/Outputs

Inputs

Brick takes the dataset, which contains the binary target variable and independent predictors

Outputs

Brick produces the dataset, which contains:

Columns that are considered as informative predictors
Binary target variable
Columns, which were excluded from the analysis

Example of usage

Let's consider the binary classification problem

🛥️

Binary classification : Titanic. The inverse target variable takes two values - survived (0) - good or non-event case / not-survived (1) - bad or event case. The general information about predictors is represented below:

passengerid (category) - ID of passenger

name (category) - Passenger's name

pclass (category) - Ticket class

sex (category) - Gender

age (numeric) - Age in years

sibsp (numeric) - Number of siblings / spouses aboard the Titanic

parch (category) - Number of parents / children aboard the Titanic

ticket (category) - Ticket number

fare (numeric) - Passenger fare

cabin (category) - Cabin number

embarked (category) - Port of Embarkation

As we know, the variables with good predictive power should be characterized by Information Value higher than 0.5. Information Value Brick may help us to make the preliminary assessment of the dataset (please, see

Information Value for the details):

Now we may see that predictors sex, fare, pclass, and embarked can be considered as informative with IV threshold equals to 0.5.

For instance, we would like to leave the informative variables only, but save the passengers' names in the output dataset as identifiers. For this purpose, we should put Variable Selection brick to the pipeline and configure it:

pass the input data set to the Variable Selection brick

define the target (survived column)

add column name to the Columns to exclude list

set Information Value Threshold equals to 0.05

run pipeline

Assessment Information Values of the predictors

First of all, we may get a general assessment of the predictors in the context of their ability to predict the target variable. For this purpose we should:

pass the input data set to the Information Value brick

define the target (survived column)

choose the "Information Value only" mode

run pipeline

In order to see the assessment result, you should open the Output data previewer on the right sidebar.

The result is depicted in the table:

As we can see, the result dataset contains all informative variables (sex, fare, pclass, and embarked), saved column - name, and target variable - survived. The rest columns were excluded as non-informative ones.