Binary classification : Titanic
🛥️

Binary classification : Titanic

Description

Pipeline for solving classical binary classification problem - Titanic passengers survival prediction. This pipeline demonstrates the applying data processing and machine learning scenario to create a model that assesses which passengers are more likely to survive the Titanic shipwreck.

Problem Statement

Based on the information about the Titanic's passengers, predict if a particular passenger survived the Titanic shipwreck or not.

Dataset

Titanic Passengers public dataset, which contains 891 records about the Titanic's passengers with the expected outcome for each passenger (survived or not)
 
Data Description
Variable
Definition
Key
Column
ID of passenger
Survival
0 = No, 1 = Yes
Passenger's name
Ticket class
1 = 1st, 2 = 2nd, 3 = 3rd
Sex
Age in years
number of siblings / spouses aboard the Titanic
number of parents / children aboard the Titanic
Ticket number
Passenger fare
Cabin number
Port of Embarkation
C = Cherbourg, Q = Queenstown, S = Southampton

Target Variable

  • survived

Datrics Pipeline

Pipeline Shema

notion image

Pipeline Description

  1. Data Load
    1. notion image
      Use sample_titanic_train.csv form Datrics Storage→Samples.
  1. Feature Engineerings
    1. First of all, we need to describe each object in our dataset via a set of informative features - features that have good predictive abilities. For this purpose, we perform the analysis of input data in order to understand which features do not bring useful information for further analysis and modeling, highlight and make a decision regarding incomplete data problem solving and represent some features in the more appropriate view:
    2. Remove non-informative columns - column with a very high index of missing values, filled with some constant value, or with a very high ratio of unique values (like name, or user ID).
      1. notion image
        In our case we remove passengers' names, IDs, and tickets, because they identify particular passengers, but not their general specific with respect to the target variable.
        notion image
    3. Data Imputation - missing values treatment - replacement empty cells with the most appropriate value with respect to the data distribution
      1. notion image
        As we can see, there is only attribute with missing values - Age. One of the most appropriate ways for missing value treatment for this kind of variable - to use mean calculated on input sampling.
        notion image
    4. Feature Transformation - change feature values and/or derivite new features
      1. notion image
        First of all, we should transform the "Sex" variable to the binary view, because it's the most suitable for most machine learning models. In addition, we introduce a new feature - "is_child", which is used for the correct stratified sampling, because the initial sampling is biased to the side of adults.
        notion image
    5. Feature Selection - select features that are used for the model training
      1. notion image
        As we introduce the new feature that represents the passenger's sex, we should remove the initial one.
        notion image
  1. Prepare data sampling for model training and validation
    1. notion image
      For the model training, we may take 80% of input sampling with the providing of the train/test subgroups homogeneity. That's why we make stratification sampling with respect to the target variable ("survived") and input feature "is_child".
      notion image
  1. Get final feature set
    1. notion image
      Column "is_child" has a functional dependency with the initial attribute "Age", and was created for the stratification sampling only. That's why we have to delete it from the input feature-set both for train and test samplings.
      notion image
  1. Train model for binary classification
    1. notion image
      As a core of the proposed solution we decided to use Gradient Boosting Binary Classification Trees, which can be replaced with any other binary classifier, which supports the proposed form of input features. Model parameters:
      • Learning Rate - 0.1
      • Number of Iterations - 100
      • Number of leaves - 31
  1. Apply to new data
    1. notion image

Pipeline results

Model performance

The model provides 83% accuracy on the test set. To see the final results, press "Model performance" button on the Predict brick menu.
notion image

Feature Importance

The predictive importance of the input features for the assessment target variable can be reached via Train Brick→ Model Performance dashboard. As was expected, in our case the most important features are "is_mail", "fare" and "age".
notion image

Prediction Results

Model prediction results can be reviewed and analysed in Predict→Output dashboard.
notion image
notion image

Model Save and Deployment

🥒
Models Save/Download
💻
AutoModel APIs