Let’s take a close look at what Skewness and Kurtosis are! Two Data Science measures that help you analyze disparity in a dataset.
When doing Data Science, it’s important to know how to analyze the disparity of a dataset.
Disparity is when the data in a dataset is unbalanced.
For example, if you measure the age of people in a children’s playground where there are 20 children and 6 adults. The age of these people will be disparate.
Indeed, the majority of the people are children, say between 4 and 8 years old. And for the adults, there will be two babysitters between 20 and 24 years old, 3 parents between 28 and 35 years old and finally a 60 years old grandmother.
And because a picture is worth a thousand words, here is the dataset graph:
Here we can see that the dataset is disparate. Most of the data are grouped on the left, while single individuals are on the right. We call these solitary data “outliers”, data “out of the norm”.
But then what is the purpose of Skewness and Kurtosis?
Above we’ve analyzed the disparity with the naked eye.
On this graph it is quite obvious because I wanted to take an exaggeratedly disparate example. In this case the eye does the job to determine the disparity.
But there are mathematical measures to do this analysis.
Much more precise measures that, when known, give us direct information about our dataset.
The first one is the Skewness.
Skewness computes the symmetry of our dataset.
A dataset is symmetrical when the data are equally distributed on both sides of the average.
When Skewness is equal to 0, the dataset is symmetrical.
But this measure also tells us about the type of skewness.
If the Skewness is greater than 0, then the dataset is skewed to the right. That is, the majority of the data is on the left and the outliers are on the right.
If the Skewness is less than 0, then the dataset is skewed to the left. That is, the majority of the data is on the right and the outliers are on the left.
In our example of the children’s park, the skewness is 2.47.
Indeed we can see on the graph that the dataset is skewed to the right:
Since adults are much older than children, they unbalance the dataset on the right.
Let’s clean up the dataset by removing all the adults.
We obtain this graph:
Here, there is an imbalance. But it is much more difficult to see with the naked eye.
When we calculate the Skewness we get -0.006.
The imbalance is very slight, on the left this time.
It is the precision of the Skewness that allows us to determine this imbalance.
In addition to the asymmetry, the Skewness computes the strength of the imbalance. Here -0.006 indicates a very slight imbalance. A value of 1 is a normal imbalance.
Here is the code to calculate the Skewness on a Pandas Dataframe in Python :
The Kurtosis also computes the disparity of a dataset. But using another approach.
The Kurtosis computes the flatness of our curve.
A dataset is flat when the data are equally distributed. It is humped when there is a grouping of data in one area.
Not to be confused with symmetry!
Let’s take the context of a symmetrical dataset like the one in our example:
Here, the Kurtosis will allow us to calculate the distance from the mean.
The more data are far from the mean 6, the flatter the graph will be.
In our example, the graph is humped and symmetrical. This indicates that there is a clustering of data at the average level.
For a symmetric dataset:
If the Kurtosis is greater than 0, then the dataset is leptokurtic. i. e, the majority of the data is at the mean, and the hump is accentuated.
If the Kurtosis is greater than 0, then the dataset is platykurtic. i. e, the data tends to move away from the mean, and that the hump is flattened.
When the Kurtosis is equal to 0, then the dataset is mesokurtic. i. e, the dataset is humped following a normal distribution.
In our example, the Kurtosis is -0.03.
Kurtosis also allows to determine the slope of the hump. Here -0.03 indicates a very slightly sloping hump.
Here is the code to calculate the Kurtosis on a Pandas dataframe in Python :
Disparate dataset = bad dataset ?
In this article, you’ve learned how to calculate the Skewness and Kurtosis.
With these, you will be able to analyze more precisely the disparity of your dataset.
But keep in mind that a disparate dataset is not necessarily a bad dataset!
First and foremost, Skew & Kurtosis are a good indicators to understand your data. The interpretation of this result will depend on the context of your project.
See you soon in a next article 😉
- Texas A&M University-Commerce – Running Descriptive on SPSS