Split Data

Description

Splitting data to train and test data is essential to ensure that the developed ML model works correctly by testing it on previously unseen data. By doing so we have an ability to check if the model is overfitting or underfitting, so we can further improve its architecture or data preprocessing.

Use

First, you should select the test split ratio to determine what fraction of the dataset will be withheld for the model's validation (0.1 stands for 90% training and 10% test split).
You can shuffle the dataset before the splitting to get different results on each run and decrease the chances that train and test sets will contain different non-overlapping classes (e. g., transactional data might be initially sorted by store id, and some stores will not be used for training, while others will not be evaluated). When working with time series, be mindful that some algorithms expect your data to be sorted, so shuffling might negatively impact model performance.
In addition, you can perform stratified sampling on specified columns (you can add additional columns by clicking '+' button)
 
notion image

Recommendations

  • Make sure that the test set Is representative of the data set as a whole. It's advisable to shuffle the dataset when possible.
  • Make sure that test set Is large enough to yield statistically meaningful results (e.g. 500 observations require at least 20% to be withheld for validation, while 1 million records might work well with under 5%).
  • Never train on test data. It might cause model's overfitting. This dataset is only used for model evaluation and different model comparisons to see how well the model performs on unseen data and evaluate how model will perform in real-life scenarios.

Example:

Input Variable: digits = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] test_size = 0.4
 
Output Variables: train_set = [0, 1, 2, 3, 4, 5] test_set = [6, 7, 8, 9]