THIS easy technique will BOOST your Machine Learning algorithm

Many beginners forget this technique when they create their Machine Learning algorithm. Yet… it is crucial !

This technique is called normalization.

Normalizing is putting the values of your dataframe at the same scale. For example between 0 and 1.

It’s a simple technique but it will boost the performance of your algorithm !

Get started

For this tutorial, we’ll take the happiness.csv dataset that you can download here on GitHub.

First, we import the pandas library and our dataset :

import pandas as pd

df = pd.read_csv('happiness.csv', usecols=['Gender','Mean','N='])

As you can see, for this tutorial, we only use the ‘Gender’, ‘Mean’, ‘N=’ columns.

These columns represent respectively the gender of the interviewees, their average happiness rate and finally the number of interviewees.

The dataset groups these people by country but… for this tutorial, we don’t need this category 😉

We can display the first lines of our dataset :

df.head()

And we can get down to business !

Normalizing numbers

The first normalization is applied to quantitative data, in other words: numbers.

To do so, we apply a very simple method : divide all the numbers in a column by the max value.

This will give us values between 0 and 1.

Indeed, the max value will be divided by itself so it will be equal to 1. The others will have no choice but to be between 0 and 1.

Keep in mind that this technique only works for positive values. But other methods exist.

To apply the normalization to a whole column we use the applymap() function.

Then we specify to applymap to divide each x, each value, by the maximum value of the column :

mean_max = df[['Mean']].max()

df['Mean'] = df[['Mean']].applymap(lambda x: x/mean_max)

Afterwards, we apply exactly the same method for the ‘N=’ column :

n_max = df[['N=']].max()

df['N='] = df[['N=']].applymap(lambda x: float(x/n_max))

And that’s it ! We have normalizedthe quantitative values of our dataset.

To check our result, we can display first lines of the dataset :

THE PANE METHOD FOR DEEP LEARNING!

Get your 7 DAYS FREE TRAINING to learn how to create your first ARTIFICIAL INTELLIGENCE!

For the next 7 days I will show you how to use Neural Networks.

You will learn what Deep Learning is with concrete examples that will stick in your head.

BEWARE, this email series is not for everyone. If you are the kind of person who likes theoretical and academic courses, you can skip it.

But if you want to learn the PANE method to do Deep Learning, click here :

df.head()

We are heading in the right direction ! But there is still one column that is problematic for us : ‘Gender’.

Normalizing the categories

To normalize the column ‘Gender’, we must first understand it.

We may analyze it with the unique() function which displays the different values that exist in a column:

df.Gender.unique()

Output: [‘Male’, ‘Female’, ‘Both’]

We see that there are only 3 possible options Male, Female, or the sum of both categories.

Since there are only a few options, we can easily deduce that this is a categorical variable.

And that’s just perfect for us !

In fact with pandas it is much easier to normalize categories.

For this we use the get_dummies() function :

df = pd.get_dummies(df)

We can display the result to see exactly how it is:

df.head()

The category is nicely normalized ! The get_dummies() function creates 3 columns corresponding to each gender.

If the row corresponds to a man, the 1 is put in the corresponding column. Else 0. Same for women, and men & women.

Conclusion

That’s it ! We have normalized our dataset in a very short time.

Finally, we end up with more columns, so more data than at the beginning. One could think that this will make the learning of our Machine Learning algorithm more complex.

Quite the opposite ! Since all the data are between 0 and 1, the algorithm has less effort to make and will be more likely to perform its main task !

We have just seen how to normalize our data but in truth… there are libraries that may do this work for you without you needing to declare it. Find out more in this article !

sources :

THE PANE METHOD FOR DEEP LEARNING!

Get your 7 DAYS FREE TRAINING to learn how to create your first ARTIFICIAL INTELLIGENCE!

For the next 7 days I will show you how to use Neural Networks.

You will learn what Deep Learning is with concrete examples that will stick in your head.

BEWARE, this email series is not for everyone. If you are the kind of person who likes theoretical and academic courses, you can skip it.

But if you want to learn the PANE method to do Deep Learning, click here :

Tom Keldenich
Tom Keldenich

Data Engineer & passionate about Artificial Intelligence !

Founder of the website Inside Machine Learning

Leave a Reply

Your email address will not be published. Required fields are marked *

Enter your email to receive for free

The PANE method for Deep Learning

* indicates required

 

You will receive one email per day for 7 days – then you will receive my newsletter.
Your information will never be given to third parties.

You can unsubscribe in 1 click from any of my emails.

Entre ton email pour recevoir gratuitement
la méthode PARÉ pour faire du Deep Learning


Tu recevras un email par jour pendant 7 jours - puis tu recevras ma newsletter.
Tes informations ne seront jamais cédées à des tiers.

Tu peux te désinscrire en 1 clic depuis n'importe lequel de mes emails.