🛥️

# Binary classification : Titanic

## Description

Pipeline for solving classical binary classification problem - Titanic passengers survival prediction. This pipeline demonstrates the applying data processing and machine learning scenario to create a model that assesses which passengers are more likely to survive the Titanic shipwreck.

### Problem Statement

Based on the information about the Titanic's passengers, predict if a particular passenger survived the Titanic shipwreck or not.

### Dataset

Titanic Passengers public dataset, which contains 891 records about the Titanic's passengers with the expected outcome for each passenger (survived or not)

Data Description
Variable
Definition
Key
Column
ID of passenger
Survival
0 = No, 1 = Yes
Passenger's name
Ticket class
1 = 1st, 2 = 2nd, 3 = 3rd
Sex
Age in years
number of siblings / spouses aboard the Titanic
number of parents / children aboard the Titanic
Ticket number
Passenger fare
Cabin number
Port of Embarkation
C = Cherbourg, Q = Queenstown, S = Southampton

• survived

## Datrics Pipeline

### Pipeline Description

1. Use sample_titanic_train.csv form Datrics Storage→Samples.
1. Feature Engineerings
1. First of all, we need to describe each object in our dataset via a set of informative features - features that have good predictive abilities. For this purpose, we perform the analysis of input data in order to understand which features do not bring useful information for further analysis and modeling, highlight and make a decision regarding incomplete data problem solving and represent some features in the more appropriate view:
2. Remove non-informative columns - column with a very high index of missing values, filled with some constant value, or with a very high ratio of unique values (like name, or user ID).
1. In our case we remove passengers' names, IDs, and tickets, because they identify particular passengers, but not their general specific with respect to the target variable.
3. Data Imputation - missing values treatment - replacement empty cells with the most appropriate value with respect to the data distribution
1. As we can see, there is only attribute with missing values - Age. One of the most appropriate ways for missing value treatment for this kind of variable - to use mean calculated on input sampling.
4. Feature Transformation - change feature values and/or derivite new features
1. First of all, we should transform the "Sex" variable to the binary view, because it's the most suitable for most machine learning models. In addition, we introduce a new feature - "is_child", which is used for the correct stratified sampling, because the initial sampling is biased to the side of adults.
5. Feature Selection - select features that are used for the model training
1. As we introduce the new feature that represents the passenger's sex, we should remove the initial one.
1. Prepare data sampling for model training and validation
1. For the model training, we may take 80% of input sampling with the providing of the train/test subgroups homogeneity. That's why we make stratification sampling with respect to the target variable ("survived") and input feature "is_child".
1. Get final feature set
1. Column "is_child" has a functional dependency with the initial attribute "Age", and was created for the stratification sampling only. That's why we have to delete it from the input feature-set both for train and test samplings.
1. Train model for binary classification
1. As a core of the proposed solution we decided to use Gradient Boosting Binary Classification Trees, which can be replaced with any other binary classifier, which supports the proposed form of input features. Model parameters:
• Learning Rate - 0.1
• Number of Iterations - 100
• Number of leaves - 31
1. Apply to new data

## Pipeline results

### Model performance

The model provides 83% accuracy on the test set. To see the final results, press "Model performance" button on the Predict brick menu.

### Feature Importance

The predictive importance of the input features for the assessment target variable can be reached via Train Brick→ Model Performance dashboard. As was expected, in our case the most important features are "is_mail", "fare" and "age".

### Prediction Results

Model prediction results can be reviewed and analysed in Predict→Output dashboard.

🥒