Binary classification: Titanic (new)

Binary classification: Titanic (new)


The presented pipeline solves a classical binary classification problem - Titanic passengers survival prediction. This pipeline demonstrates the applying data processing and machine learning scenario to assess which passengers are more likely to survive the Titanic shipwreck.

Problem Statement

Based on the information about the Titanic's passengers, predict if a particular passenger survived the Titanic shipwreck or not.



Modeling scenario

General Schema of the Binary Classification: Titanic can be depicted as a sequence:
  1. Prepare the initial dataset with target variable (survived) and potentially useful explanatory variables
  1. Extract features that may be relevant for model training
  1. Split initial data into train\test sets
  1. Train the prediction models using the train set
  1. Evaluate built models on the test set and select the best one for further use

Datrics Pipeline

Pipeline Shema

The full pipeline is presented in the following way:
notion image

Pipeline Scenario

Overall, the pipeline can be split into the following groups: dataset preprocessing, feature engineering, data splitting, model training and model testing. Let us consider every group in detail below.

Dataset preprocessing

Firstly, we upload the data from Storage → Samples → titanic.csv and verify the number of missing values in the given dataset.
notion image
Here we propose to fill the empty values of age and fare columns with the median value calculated on input sampling while deleting the cabin column due to a very high index of missing values. Also, we apply autosuggestion for embarked column and, thus, fill its missing values with the S category.
After that, we filter out other non-informative columns with a very high ratio of unique values, such as passenger name and ticket number, since they identify particular passengers but not their general characteristics.
notion image

Feature engineering

Now we are ready to experiment with feature engineering by encoding the existing variables or deriving new features.
Firstly, we try to transform the continuous variables (age and fare) to categorical by applying IV optimal automatic binning with respect to the target variable (survived):
notion image
notion image
After that, we received new columns age_bins and fare_bins, so can remove old columns age and fare from the dataset:
notion image
Now we encode some of the columns as follows:
notion image
As you can see, we transform the sex column to the binary category, assign label encoder to age_bins and fare_bins, and create dummy variables for the embarked column.
Finally, we provide the relevant data types for existing columns and proceed with the obtained dataset to the next step.
notion image

Split data

notion image
Split Data

Here we prepare data sampling for model training and validation.
For the model testing, we take 20% of the input sample, and in order to satisfy subgroups homogeneity, apply additional stratification by survived and age_bins variables.
notion image

Model training

Now, using the train data we build several forecasting models to compare and select the best one for solving our case.
The first model we chose to experiment with is Random Forest Binary. Here we use all the features that are presented in our dataset, specify the target variable (survived) and define the model hyperparameters as follows:
notion image
Alternatively, using the same train set we build another tree-based model - LGBM Binary with the following settings:
notion image
After performing these steps we can evaluate the models' performance on the train data using Model info → Open view on every brick menu and, for instance, compare the common prediction metrics of two models.
Random Forest Binary performance (train)
notion image
LGBM Binary performance (train)
notion image
Additionally, on the same tab we can review other model characteristics such as feature importance, ROC AUC curve etc.
notion image
notion image

Model testing

notion image

Thereafter, we evaluate the trained models on the test set using Predict brick and compare the models' performance on the tab Predict stats → Model Performance on every brick:
Random Forest Binary performance (test)
notion image
LGBM Binary performance (test)
notion image
From that we can see that the LGBM Binary model gives us more accurate results both on the train and test sets, therefore it makes sense to use this model for further predictions.
Note, that we can also extract the predictions from the output table of the brick Predict - all derived prediction values are stored in the new column predicted_survived:
notion image

Model Save and Deployment

Models Save/Download
AutoModel APIs