An outlier is a value that lies far from the other values. Outliers affect general data patterns and distort real results. Before statistical analysis, it is recommended to exclude them from the dataset. There are several ways how to find outliers and they will be described further.
Interquartile range is often used to detect outliers. The method takes 1st and 3rd quartiles (Q1 and Q3) to calculate interquartile range and then find lower and upper bounds. The formulas are:
So if the value is below lower bound or above upper bound, it is considered as an outlier.
Other way is using anomaly detection algorithm such as isolation forest. The logic behind this is to find isolated points by recursive partitioning. In this case we do not need to find the normal values and compare all the points, which making this algorithm fast and accurate. The number of splittings required to isolate a sample is equivalent to the path length from the root node to the terminating node. This path length, averaged over a forest of such random trees, is a measure of normality and our decision function.
There is also an SVM-based algorithm for anomaly detection, which also can be used to detect outliers. SVM is used to separate two classes with one hyperplane with the largest possible margin. One-class SVM uses hypersphere instead of hyperplane and tries to separate outliers with the smallest possible hypersphere.
Also boxplots and other visualization techniques are used in order to detect outliers.
Bricks → Analytics → Outliers Treatment
Explain what parameters of the brick needs to or can be filled.
- remove outliers
- select outliers
- indicate outliers
There are three ways to treat outliers:
- Outliers treatment strategy
- Interquartile range
- Isolation forest
- One-Class SVM
You can choose:
- Percent (using in Isolation Forest and One-Class SVM)
Can be any decimal number in range from 0.01 to 0.5.
If you have columns in your data that need to be ignored and not be shown in the dashboard or in the output data, you should specify them in this parameter. To select multiple columns, click the '+' button in the brick settings.
In addition, you can ignore all columns except the ones you specified by enabling the "Remove all except selected" option. This may be useful if you have a large number of columns while needing just several of them to be analyzed.
In a simple mode ‘interquartile range’ is used as a strategy and ‘remove outliers’ as a way to treat the outliers.
Brick takes the dataset
Depending on chosen treatment, brick produces the dataset without outliers, the dataset that contains all the outliers or the dataset with a new column which returns -1 for outliers and 1 for normal values.
For demonstration, let us consider the Titanic dataset, which subset is presented below:
If we choose ‘select outliers’, only outliers will be returned in a dataset
We turn on advanced settings to choose Isolation Forest algorithm and leave default percent.
As a result we get a dataset with a predicted_outliers column, where outliers marked as -1.