Isolation Forest Model

General information

Isolation Forest is an unsupervised ensemble learning algorithm for anomaly detection that works on the principle of isolating anomalies in the leaves.
Isolation Forest isolates observations by randomly selecting a feature and then randomly selecting a split value between the minimum and maximum values of the selected feature.
The path length from the root node to the terminating node averaged over a forest of such random trees, is a measure of normality and our decision function.
notion image
Random partitioning produces noticeably shorter paths for anomalies. Hence, when a forest of random trees collectively produces shorter path lengths for particular samples, they are very likely to be anomalies.

Description

Brick Locations

Bricks Machine Learning → Isolation Forest

Brick Parameters

  • Number of estimators
    • The number of base estimators in the ensemble (100 by default).
  • Contamination
    • The amount of contamination of the dataset, i.e. the expected proportion of outliers in the dataset (0.1 by default).
  • Columns
    • Columns from the dataset that are ignored by the algorithm. However, they will be present in the resulting set. Multiple columns can be selected by clicking the + button.
      In case you want to remove a large number of columns, you can select the columns to keep and use the flag ‘Remove all except selected’.

Brick Inputs/Outputs

  • Inputs
    • Brick takes the dataset
  • Outputs
    • Brick produces the dataset with an extra column called ‘predicted_cluster’, where cluster ‘-1’ indicates the outlier and ‘1’ is a normal point
    • A model that can be used in other bricks as an input

Example of usage

Let’s try to indicate anomalies in the data from the ‘segmentation_moons.csv’ dataset using the Isolation Forest algorithm. The dataset consists of 3 columns: ‘Unnamed: 0’, ‘0’ and ‘1’.
notion image
We can connect this dataset directly to the Isolation Forest Brick and leave the default parameter values for the ‘Number of estimators’ as 100 and the ‘Contamination’ equal to 0,1. Also, we should filter column ‘Unnamed: 0’ as it sets the index of the record and doesn’t represent any feature of the sample.
notion image
notion image
The output dataset looks like this:
notion image
We can visualize the model predictions using the Charts Brick with the following settings:
notion image
notion image
After running the pipeline we get a scatterplot with the predicted anomalies.
notion image
Increasing the ‘Contamination’ rate to 0.3 will cause more points to be indicated as anomalies.
notion image
On the other hand, if ‘Contamination’ is a small number like 0.05, fewer points are highlighted as anomalies.
notion image