In this article we talk about loss function. Many losses exist in Machine Learning and it is important to choose them well.
The loss function is mainly used in Neural Networks.
So when you hear loss function, you must understand Deep Learning.
It’s one of the essential components that enables you to train your model.
But first of all, why do we need loss function?
The loss function is used to calculate the error between the predictions of your model and the real values.
The lower the loss, the better the model performs!
During training, this allows the model to know whether it is moving in the right direction or not.
The goal is to get the loss as close to zero as possible.
However, there are several ways to calculate the error between predictions and real values.
This will depend mainly on the task you have to solve (but not only).
First, let’s dive into a short reminder about classes and labels which will help us to understand the different loss functions.
Class vs label
Often confused, class and label are two important concepts in Machine Learning.
The label is the data to predict in a dataset.
The class represents one of the possible choices during the prediction.
For example, if you need to predict the species of an animal (dog or cat) then :
- the species is the label
- dog and cat are the two classes
If you’re new to Deep Learning, you will most often encounter problems with one label and two classes.
This is binary classification.
Then, as you get more experienced, you will encounter multi-class classification problems.
Depending on the type of problem you face, you will have to choose the right loss function!
If you want to know more about the differences between class and label, I invite you to read our in-depth article on the subject!
Now, let’s move on to loss functions ☄️
Binary Classification
Binary Classification is the task of classifying data into a label containing only two classes.
In this case, the model will predict the probability of belonging to each class.
For this type of task, your last layer must have a sigmoid activation function (see our article on activation functions).
Binary Cross Entropy :
The Cross Entropy loss function is probably the one you will need most often.
It is used in Binary classification AND in multi-class classification!
Binary Cross Entropy is a special case of Cross Entropy.
It is designed to be used only on binary classification tasks.
Here your two classes must be represented by 0s and 1s.
To use it with Keras :
model.compile(loss='binary_crossentropy', optimizer=...)
With TensorFlow :
loss = tf.keras.losses.BinaryCrossentropy()
loss(y_true, y_pred)
and with PyTorch :
loss = nn.BCELoss()
loss(y_pred, y_true)
And for the mathematicians, here is the Cross Entropy formula:
def binary_cross_entropy_loss(y_pred, y_true):
return -np.sum(y_true * np.log(y_pred))
Let’s take our previous example: wee have the species label and the two classes cat and dog.
Each image is represented as follows: [1, 0] for cats and [0, 1] for dogs.
If one of our images represents a dog and our model predicts that it is a dog with 0.85% confidence, the loss calculation is as follows:
Binary Cross Entropy = -(0 x log(0.15) + 1 x log(0.85))
= 0.07
Hinge Loss
Hinge Loss is not well known, nevertheless, it’s a good alternative to Binary Cross Entropy.
But how to choose between the two functions?
With Hinge Loss, your two classes must be represented by -1s and 1s.
Hinge Loss is a good choice when the data is complex (non-linearly separable) and when the number of negative examples is much larger than the number of positive examples.
It is notably used in Support Vector Machines (SVM).
Binary Cross entropy, on the other hand, is a good choice when working with neural networks and when the objective is to predict probabilities.
Be careful, if you use Hinge Loss, your last layer must have a tanh activation function to give a value between -1 and 1.
To use Hinge Loss with Keras and TensorFlow:
loss = tf.keras.losses.Hinge()
loss(y_true, y_pred)
With PyTorch :
loss = nn.HingeEmbeddingLoss()
loss(y_pred, y_true)
And here is the mathematical formula:
def hinge_loss(y_pred, y_true):
return np.maximum(0, 1 - y_true * y_pred)
Going back to our example: you have the species label and the two classes cat and dog.
Each image is represented as follows: -1 for cats and 1 for dogs.
If one of your images represents a dog and your model predicts that it is a dog with 0.85% confidence, the loss calculation is as follows:
Hinge Loss = np.maximum(0, 1 - 1 * 0.85)
= np.maximum(0, 0.15)
= 0.15
Multi-class classification
Multi-class classification is the classification of data into a label containing two or more classes.
Again, the model will predict the probability of belonging to each class.
Cross Entropy
Cross Entropy is one of the most popular loss functions.
Again, it is used in Binary classification AND in multi-class classification!
With this loss, each of your classes must be represented by a single number: 0, 1, 2 etc.
To use it with Keras :
model.compile(loss='sparse_categorical_crossentropy', optimizer=...)
With TensorFlow :
loss = tf.keras.losses.SparseCategoricalCrossentropy()
loss(y_pred, y_true)
and with PyTorch :
loss = nn.CrossEntropyLoss()
loss(y_pred, y_true)
The mathematical formula for Cross Entropy is similar to that of Binary Cross Entropy:
def cross_entropy_loss(y_pred, y_true):
return -np.sum(y_true * np.log(y_pred))
Let’s imagine this time that we have to predict 3 species: dog, cat and duck.
For each image we have : [1, 0, 0] for cats, [0, 1, 0] for dogs and [0, 0, 1] for ducks.
If one of your images represents a cat and your model predicts that it is a duck with 0.72% confidence, the loss calculation is as follows:
Categorical Cross Entropy = -(1 x log(0.16) + 0 x log(0.12) + 0 x log(0.72))
= 0.79
Kullback-Leibler Divergence
The Kullback-Leibler Divergence, also called KL Divergence, is used for classification but also for generation tasks.
So how to choose between KL Divergence and Categorical Cross Entropy?
By the way, if your goal is to master Deep Learning - I've prepared the Action plan to Master Neural networks. for you.
7 days of free advice from an Artificial Intelligence engineer to learn how to master neural networks from scratch:
- Plan your training
- Structure your projects
- Develop your Artificial Intelligence algorithms
I have based this program on scientific facts, on approaches proven by researchers, but also on my own techniques, which I have devised as I have gained experience in the field of Deep Learning.
To access it, click here :
Now we can get back to what I was talking about earlier.
The choice will depend on you.
In practice, both formulas give a very similar result.
I advise you to start with Categorical Cross Entropy. Then, when optimising your model, try the KL Divergence.
To use it with Keras and TensorFlow :
loss = tf.keras.losses.KLDivergence
loss(y_pred, y_true)
To use it with PyTorch :
loss = nn.KLDivLoss()
loss(y_pred, y_true)
And finally, here is the mathematical formula:
def kullback_leibler_divergence(y_pred, y_true):
return y_true * np.log(y_true / y_pred)
Regression Tasks
Regression is the prediction of a value, not a probability.
In this case, there is no class, only a label.
Mean Squared Error
The Mean Squared Error (MSE) is one of the most commonly used loss functions in regression problems.
It calculates the mean squared difference between the predicted and real values.
To use it with Keras and TensorFlow :
loss = tf.keras.losses.MeanSquaredError()
loss(y_true, y_pred)
With PyTorch :
loss = nn.MSELoss()
loss(y_pred, y_true)
And here is the mathematical formula:
def mean_squared_error(y_pred, y_true):
return np.mean((y_pred - y_true) ** 2)
Let’s take an example: you have to predict the price of Google stock.
If last Friday the stock was at $100 and your model predicted that the stock would be at $92, the loss calculation is as follows:
Mean Squared Error = (100 - 92)**2
= 64
You can see that the loss is far from 0.
This is because here the loss is no longer calculated on probabilities but on real values.
To evaluate the quality of this result, it is important to know the context of the problem.
For example, here the result of the loss calculation is 64.
This is a large number.
One might think that the model is weak.
But if the stock price is $100, and you have to predict it, you could expect a good model to have a difference of a little more than 5, or even more than 10.
If the difference between the prediction and the actual value is between 5 and 10, the loss will then be between 25 and 100.
This means that a loss of 64 indicates a model with good accuracy.
To go deeper in this phenomenon, you can calculate the loss on the prediction of an investment:
- real value
= 1.2 billion dollars
- prediction
= 1.19 billion dollars
You will see that the value of the loss is enormous. But, by contextualising the result, you’ll realize that the prediction is very accurate.
In a Regression task, before evaluating the loss, establish the context of the problem.
Mean Absolute Error
The second most commonly used function for regression tasks is the Mean Absolute Error (MAE).
It calculates the mean absolute difference between the predicted and real values.
To use it with Keras and TensorFlow :
loss = tf.keras.losses.MeanAbsoluteError()
loss(y_true, y_pred)
With PyTorch :
loss = nn.L1Loss()
loss(y_pred, y_true)
And here is the mathematical formula:
def mean_absolute_error(y_pred, y_true):
return np.mean(np.abs(y_pred - y_true))
Let’s go back to our example: you have to predict the price of Google stock.
If last Friday the stock was at $100 and your model predicted that the stock would be at $92, the loss calculation is as follows:
Mean Absolute Error = abs(100 - 92)
= 8
Huber Loss
Huber Loss is a lesser known, yet very effective function.
It is particularly useful when your dataset contains a lot of outliers (data that are far from the average).
Here is how to use it with Keras and TensorFlow:
loss = tf.keras.losses.Huber()
loss(y_true, y_pred)
With PyTorch :
loss = nn.HuberLoss()
loss(y_pred, y_true)
And the mathematical formula (more complex than the previous ones):
def huber_loss(y_true, y_pred, delta=1.0):
residual = y_true - y_pred
condition = np.abs(residual) <= delta
quadratic_loss = 0.5 * residual**2
linear_loss = delta * (np.abs(residual) - 0.5 * delta)
return np.mean(np.where(condition, quadratic_loss, linear_loss))
Let’s go through our example again: you have to predict the price of Google stock.
If last Friday the stock was at $100 and your model predicted that the stock would be at $92, the loss calculation is as follows:
Huber Loss = 0.5 * min(abs(100 - 92), 1.0)**2 + 1.0 * (
– abs(100 - 92)
min(abs(100 - 92), 1.0)
)
= 7.5
Wrap-up
During training, the neural network goal is to bring the loss as close to zero as possible.
It is crucial to choose a good loss function, otherwise your model may never do what you wanted.
Be careful with regression tasks, the quality of your loss result depends on your context (see example in Mean Squared Error).
Below, I provide you with my template on which loss function to use depending on the type of problem and the activation function you’ve chosen in the last layer of your model.
Problem type | Activation function of the last layer | Loss function |
Binary Classification | sigmoid | Binary Cross Entropy |
Binary Classification | tanh | Hinge Loss |
Multi-class Classification | softmax | Categorical Cross Entropy or KL Divergence |
Regression | aucune | MAE, MSE or Huber Loss |
And if your model can’t get close to zero, ask yourself these four questions:
- What is my problem?
- Have I chosen the right loss function?
- If I am doing regression, what is the context of my problem?
- Is my model optimised?
If you have reached question 4, it means that you want to optimise your model!
In that case you can learn the best optimization techniques with our articles dedicated to the subject:
See you soon on Inside Machine Learning 😉
sources :
One last word, if you want to go further and learn about Deep Learning - I've prepared for you the Action plan to Master Neural networks. for you.
7 days of free advice from an Artificial Intelligence engineer to learn how to master neural networks from scratch:
- Plan your training
- Structure your projects
- Develop your Artificial Intelligence algorithms
I have based this program on scientific facts, on approaches proven by researchers, but also on my own techniques, which I have devised as I have gained experience in the field of Deep Learning.
To access it, click here :