Data Augmentation – The CSV Data Simple Method that you need

Is it possible to perform Data Augmentation on CSV data without degrading your dataset? Let’s answer that in this article.

Data Augmentation is the process of generating additional training data.

It consists in applying transformations on existing data to obtain new data.

This method is commonly used in Machine Learning to increase the size and diversity of the training data set.

Ultimately, this can contribute to improve the generalization of the model and thus, its performance.

What kind of data is concerned?

Although Data Augmentation can be applied to any type of data, it is generally easiest to perform on images, audio and text data.

These types of data can be transformed in various ways without changing the information they contain.

For example, a word in a sentence can be changed to a synonym.

The sentence will keep its meaning, but the data will be different.

For an image we can apply a rotation :

image source

This allows us to create new data while ensuring that this data remains acceptable.

Meaning that it might belong to the real world.

But for CSV data, it’s more difficult.

The case of CSV data for Data Augmentation

CSV data or tabular data, which can be found in an Excel file, can be Data Augmented.

You can easily create a new row and put random data in a column.

The problem?

You risk creating data that does not exist in the real world.

For example, if we have temperature data in different cities, we could increase this data by assigning random values to new rows.

THE PANE METHOD FOR DEEP LEARNING!

Get your 7 DAYS FREE TRAINING to learn how to create your first ARTIFICIAL INTELLIGENCE!

For the next 7 days I will show you how to use Neural Networks.

You will learn what Deep Learning is with concrete examples that will stick in your head.

BEWARE, this email series is not for everyone. If you are the kind of person who likes theoretical and academic courses, you can skip it.

But if you want to learn the PANE method to do Deep Learning, click here :

With this approach, we could end up with data like: 60°C in Paris in January.

In this case we don’t increase our data, we degrade it.

Why ?

Because this kind of temperature in Paris in January is impossible. It does not belong to the real world.

Then we could choose a range of possible values for the temperature in January in Paris and choose randomly in this range.

But the fact is that the temperature can change a lot over a month. This is not an ideal approach.

This method causes too much uncertainty which is why I think we should use another method.

The solution

Instead of Data Augmentation, a similar technique exists to improve the performance of Machine Learning models on tabular data: Featurization.

The Featurization technique consists in using the features (the columns of your dataset) to create new information.

This way you can represent them in a more meaningful format and more adapted to your model.

For example by combining information from two features.

By carefully selecting the relevant features, it is possible to improve the performance of your model.

If you want to learn Featurization, we cover the technique in detail in this article!

Conclusion – Data Augmentation on CSV data

To sum up, while Data Augmentation is a useful technique for increasing the size and diversity of a training dataset, it is not applicable to CSV files or other tabular data types

Featurization, on the other hand, is a similar technique that can be used to improve the performance of Machine Learning models.

It involves extracting and representing features in a more relevant format.

See you soon on Inside Machine Learning 😉

THE PANE METHOD FOR DEEP LEARNING!

Get your 7 DAYS FREE TRAINING to learn how to create your first ARTIFICIAL INTELLIGENCE!

For the next 7 days I will show you how to use Neural Networks.

You will learn what Deep Learning is with concrete examples that will stick in your head.

BEWARE, this email series is not for everyone. If you are the kind of person who likes theoretical and academic courses, you can skip it.

But if you want to learn the PANE method to do Deep Learning, click here :

Tom Keldenich
Tom Keldenich

Data Engineer & passionate about Artificial Intelligence !

Founder of the website Inside Machine Learning

Leave a Reply

Your email address will not be published. Required fields are marked *

This page will not stay online forever

Enter your email to receive for free

The PANE method for Deep Learning

* indicates required

 

You will receive one email per day for 7 days – then you will receive my newsletter.
Your information will never be given to third parties.

You can unsubscribe in 1 click from any of my emails.

Cette page ne restera pas en ligne éternellement


Entre ton email pour recevoir gratuitement
la méthode PARÉ pour faire du Deep Learning


Tu recevras un email par jour pendant 7 jours - puis tu recevras ma newsletter.
Tes informations ne seront jamais cédées à des tiers.

Tu peux te désinscrire en 1 clic depuis n'importe lequel de mes emails.