Outliers Treatment

General information

An outlier is a value that lies far from the other values. Outliers affect general data patterns and distort real results. Before statistical analysis, it is recommended to exclude them from the dataset. There are several ways how to find outliers and they will be described further.
Interquartile range is often used to detect outliers. The method takes 1st and 3rd quartiles (Q1 and Q3) to calculate interquartile range and then find lower and upper bounds. The formulas are:
So if the value is below lower bound or above upper bound, it is considered as an outlier.
Other way is using anomaly detection algorithm such as isolation forest. The logic behind this is to find isolated points by recursive partitioning. In this case we do not need to find the normal values and compare all the points, which making this algorithm fast and accurate. The number of splittings required to isolate a sample is equivalent to the path length from the root node to the terminating node. This path length, averaged over a forest of such random trees, is a measure of normality and our decision function.
There is also an SVM-based algorithm for anomaly detection, which also can be used to detect outliers. SVM is used to separate two classes with one hyperplane with the largest possible margin. One-class SVM uses hypersphere instead of hyperplane and tries to separate outliers with the smallest possible hypersphere.
Also boxplots and other visualization techniques are used in order to detect outliers.

Description

Brick Locations

Bricks Analytics Outliers Treatment

Brick Parameters

Explain what parameters of the brick needs to or can be filled.
  • Outcome
    • There are three ways to treat outliers:
    • remove outliers
    • select outliers
    • indicate outliers
Advanced parameters:
  • Outliers treatment strategy
    • You can choose:
    • Interquartile range
    • Isolation forest
    • One-Class SVM
  • Percent (using in Isolation Forest and One-Class SVM)
    • Can be any decimal number in range from 0.01 to 0.5.
  • Columns
    • If you have columns in your data that need to be ignored and not be shown in the dashboard or in the output data, you should specify them in this parameter. To select multiple columns, click the '+' button in the brick settings.
      In addition, you can ignore all columns except the ones you specified by enabling the "Remove all except selected" option. This may be useful if you have a large number of columns while needing just several of them to be analyzed.
Default Parameters
In a simple mode ‘interquartile range’ is used as a strategy and ‘remove outliers’ as a way to treat the outliers.

Brick Inputs/Outputs

  • Inputs
    • Brick takes the dataset
  • Outputs
    • Depending on chosen treatment, brick produces the dataset without outliers, the dataset that contains all the outliers or the dataset with a new column which returns -1 for outliers and 1 for normal values.

Example of usage

For demonstration, let us consider the 🛳️Titanic dataset, which subset is presented below:
notion image

Simple mode with select outliers

If we choose ‘select outliers’, only outliers will be returned in a dataset
notion image

Advanced mode with indicate outliers

We turn on advanced settings to choose Isolation Forest algorithm and leave default percent.
As a result we get a dataset with a predicted_outliers column, where outliers marked as -1.
notion image
undefined