Exploratory Data Analysis

General Information

This brick provide a tool to make a non-visual exploratory data analysis (EDA). EDA is highly used as one of the first step of analysis. It helps in discovering patterns, finding anomalies, understanding the nature of data and formulating hypotheses. The main metrics that data analyst looks at are mean, median, quartiles, maximum and minimum, kurtosis, skewness, dispersion and correlation. Mean calculates as a division of sum to count.
notion image
Median is the middle element of the data sample. If the difference between mean and median is significant, it may say that the data sample contains outliers.
Percentiles are used to divide data sample into equal parts and analyze how far they are from each other. Usually quartiles are used. Quartiles divide data sample into 4 parts. Second quartile is the same as median. If the difference between quartiles is huge, then there are significant outliers in the sample.
Kurtosis and skewness describe the shape of the distribution. Kurtosis tells about tailedness of the variable, while skewness shows if the distribution is symmetrical or not. Negative skew means that variable is asymmetrical with a left-sided tail, positive skew means right-sided tail.
For analyzing dispersion several metrics can be used. Variance shows how far values spread out from the mean and it is calculated as squared difference between values and mean. Standard deviation is a square root of variance.
notion image
Correlation is an important part of EDA because it helps in analyzing the relationship between variables. The closer a correlation coefficient is to 0 the lower relationship between the two variables. If the correlation coefficient is greater than 0, there is a positive relationship. If the correlation coefficient is lower than 0, it says about the negative relationship.


Brick Location

Bricks Analytics → Exploratory Data Analysis

Brick Parameters

  • Columns
    • If you have columns in your data that need to be ignored and not be shown in the dashboard or in the output data, you should specify them in this parameter. To select multiple columns, click the '+' button in the brick settings.
      In addition, you can ignore all columns except the ones you specified by enabling the "Remove all except selected" option. This may be useful if you have a large number of columns while needing just several of them to be analyzed.
  • Percentile
    • You can specify which percentiles you want to include. As a default, there are written quartiles (0.25, 0.5, 0.75).
  • Moment
    • If moment is 1, it will be equal to mean. 2nd moment shows variance (set as default). 3rd moment shows skewness. 4th is equal to kurtosis.

Brick Inputs/Outputs

  • Inputs
    • Brick takes the dataset.
  • Outputs
    • Brick generates a new data set that contains grouped information on central tendencies, percentiles, dispersions, distributions and outliers for each column that was not filtered out.

Example of Usage

Choose brick Exploratory Data Analysis and choose what percentiles and moment you want to be shown. Then, run the pipeline to activate Dashboard button.
notion image
In Dashboard mode there are three main tabs: “Statistics” for summary statistics, “Correlation” shows correlation matrix and “Details” displays data for each column with histogram. Before using this brick, make sure all the columns have the right type.
notion image
In Statistics tab you can see “Central Tendency” with calculated mean, median and mode for each column. The large difference between mean and median may say that there are outliers.
notion image
“Percentiles” displays estimated percentiles that have been chosen on the first step.
notion image
“Dispersion” includes standard deviation (STD), variance, standard error mean, interquartile range (IQR), range.
notion image
In “Distribution” tab there are skew, kurtosis, moment (that also was chosen in the beginning) and Shapiro test.
notion image
The last one is “Outliers” which displays data outside of the quartiles.
The same information is available in Details for each particular column.
notion image