Is it possible to perform Data Augmentation on CSV data without degrading your dataset? Let’s answer that in this article.
Data Augmentation is the process of generating additional training data.
It consists in applying transformations on existing data to obtain new data.
This method is commonly used in Machine Learning to increase the size and diversity of the training data set.
Ultimately, this can contribute to improve the generalization of the model and thus, its performance.
What kind of data is concerned?
Although Data Augmentation can be applied to any type of data, it is generally easiest to perform on images, audio and text data.
These types of data can be transformed in various ways without changing the information they contain.
For example, a word in a sentence can be changed to a synonym.
The sentence will keep its meaning, but the data will be different.
For an image we can apply a rotation :
This allows us to create new data while ensuring that this data remains acceptable.
Meaning that it might belong to the real world.
But for CSV data, it’s more difficult.
The case of CSV data for Data Augmentation
CSV data or tabular data, which can be found in an Excel file, can be Data Augmented.
You can easily create a new row and put random data in a column.
The problem?
You risk creating data that does not exist in the real world.
For example, if we have temperature data in different cities, we could increase this data by assigning random values to new rows.
By the way, if your goal is to master Deep Learning - I've prepared the Action plan to Master Neural networks. for you.
7 days of free advice from an Artificial Intelligence engineer to learn how to master neural networks from scratch:
- Plan your training
- Structure your projects
- Develop your Artificial Intelligence algorithms
I have based this program on scientific facts, on approaches proven by researchers, but also on my own techniques, which I have devised as I have gained experience in the field of Deep Learning.
To access it, click here :
Now we can get back to what I was talking about earlier.
With this approach, we could end up with data like: 60°C in Paris in January.
In this case we don’t increase our data, we degrade it.
Why ?
Because this kind of temperature in Paris in January is impossible. It does not belong to the real world.
Then we could choose a range of possible values for the temperature in January in Paris and choose randomly in this range.
But the fact is that the temperature can change a lot over a month. This is not an ideal approach.
This method causes too much uncertainty which is why I think we should use another method.
The solution
Instead of Data Augmentation, a similar technique exists to improve the performance of Machine Learning models on tabular data: Featurization.
The Featurization technique consists in using the features (the columns of your dataset) to create new information.
This way you can represent them in a more meaningful format and more adapted to your model.
For example by combining information from two features.
By carefully selecting the relevant features, it is possible to improve the performance of your model.
If you want to learn Featurization, we cover the technique in detail in this article!
Conclusion – Data Augmentation on CSV data
To sum up, while Data Augmentation is a useful technique for increasing the size and diversity of a training dataset, it is not applicable to CSV files or other tabular data types
Featurization, on the other hand, is a similar technique that can be used to improve the performance of Machine Learning models.
It involves extracting and representing features in a more relevant format.
See you soon on Inside Machine Learning 😉
One last word, if you want to go further and learn about Deep Learning - I've prepared for you the Action plan to Master Neural networks. for you.
7 days of free advice from an Artificial Intelligence engineer to learn how to master neural networks from scratch:
- Plan your training
- Structure your projects
- Develop your Artificial Intelligence algorithms
I have based this program on scientific facts, on approaches proven by researchers, but also on my own techniques, which I have devised as I have gained experience in the field of Deep Learning.
To access it, click here :