Data Augmentation – The CSV Data Simple Method that you need

Is it possible to perform Data Augmentation on CSV data without degrading your dataset? Let’s answer that in this article.

Data Augmentation is the process of generating additional training data.

It consists in applying transformations on existing data to obtain new data.

This method is commonly used in Machine Learning to increase the size and diversity of the training data set.

Ultimately, this can contribute to improve the generalization of the model and thus, its performance.

What kind of data is concerned?

Although Data Augmentation can be applied to any type of data, it is generally easiest to perform on images, audio and text data.

These types of data can be transformed in various ways without changing the information they contain.

For example, a word in a sentence can be changed to a synonym.

The sentence will keep its meaning, but the data will be different.

For an image we can apply a rotation :

image source

This allows us to create new data while ensuring that this data remains acceptable.

Meaning that it might belong to the real world.

But for CSV data, it’s more difficult.

The case of CSV data for Data Augmentation

CSV data or tabular data, which can be found in an Excel file, can be Data Augmented.

You can easily create a new row and put random data in a column.

The problem?

You risk creating data that does not exist in the real world.

For example, if we have temperature data in different cities, we could increase this data by assigning random values to new rows.

By the way, if your goal is to master Deep Learning - I've prepared the Action plan to Master Neural networks. for you.

7 days of free advice from an Artificial Intelligence engineer to learn how to master neural networks from scratch:

  • Plan your training
  • Structure your projects
  • Develop your Artificial Intelligence algorithms

I have based this program on scientific facts, on approaches proven by researchers, but also on my own techniques, which I have devised as I have gained experience in the field of Deep Learning.

To access it, click here :

GET MY ACTION PLAN

GET MY ACTION PLAN

Now we can get back to what I was talking about earlier.

With this approach, we could end up with data like: 60°C in Paris in January.

In this case we don’t increase our data, we degrade it.

Why ?

Because this kind of temperature in Paris in January is impossible. It does not belong to the real world.

Then we could choose a range of possible values for the temperature in January in Paris and choose randomly in this range.

But the fact is that the temperature can change a lot over a month. This is not an ideal approach.

This method causes too much uncertainty which is why I think we should use another method.

The solution

Instead of Data Augmentation, a similar technique exists to improve the performance of Machine Learning models on tabular data: Featurization.

The Featurization technique consists in using the features (the columns of your dataset) to create new information.

This way you can represent them in a more meaningful format and more adapted to your model.

For example by combining information from two features.

By carefully selecting the relevant features, it is possible to improve the performance of your model.

If you want to learn Featurization, we cover the technique in detail in this article!

Conclusion – Data Augmentation on CSV data

To sum up, while Data Augmentation is a useful technique for increasing the size and diversity of a training dataset, it is not applicable to CSV files or other tabular data types

Featurization, on the other hand, is a similar technique that can be used to improve the performance of Machine Learning models.

It involves extracting and representing features in a more relevant format.

See you soon on Inside Machine Learning 😉

One last word, if you want to go further and learn about Deep Learning - I've prepared for you the Action plan to Master Neural networks. for you.

7 days of free advice from an Artificial Intelligence engineer to learn how to master neural networks from scratch:

  • Plan your training
  • Structure your projects
  • Develop your Artificial Intelligence algorithms

I have based this program on scientific facts, on approaches proven by researchers, but also on my own techniques, which I have devised as I have gained experience in the field of Deep Learning.

To access it, click here :

GET MY ACTION PLAN

GET MY ACTION PLAN

Tom Keldenich
Tom Keldenich

Artificial Intelligence engineer and data enthusiast!

Founder of the website Inside Machine Learning

Leave a Reply

Your email address will not be published. Required fields are marked *

This page will not stay online forever

Enter your email to receive for free

The PANE method for Deep Learning

* indicates required

 

You will receive one email per day for 7 days – then you will receive my newsletter.
Your information will never be given to third parties.

You can unsubscribe in 1 click from any of my emails.



Entre ton email pour recevoir gratuitement
la méthode PARÉ pour faire du Deep Learning


Tu recevras un email par jour pendant 7 jours - puis tu recevras ma newsletter.
Tes informations ne seront jamais cédées à des tiers.

Tu peux te désinscrire en 1 clic depuis n'importe lequel de mes emails.