Activation function, how does it work? – A simple explanation

In this article we will see in details what is an activation function and its use in a Deep Learning model !

Do you remember ? In this article(for French speakers) we saw that all the neurons of a Deep Learning model apply a transformation to the data they receive as input… Well this transformation consists in applying the weights(for French speakers) and then the activation function.

The user chooses the activation function that will be applied in the neurons of a layer. For example with this code we create a Dense layer with the sigmoid function as activation function :

model.add(layers.Dense(1, activation='sigmoid'))

What is an activation function ?

Change the perspective

The activation function is primarily used to modify the data in a non-linear way. This non-linearity allows to spatially modify the representation of the data.

Simply put, the activation function allows us to change the way we see a data.

For example, if we have as data : each week 50% of the customers of a store buy chocolate bars; the activation function would allow to change the data into 50% of the customers like chocolate or 50% of the customers plan to buy chocolate each week.

The change in representation can, in our example, allow us to adjust the marketing strategy, or the inventory holdings, of our store.

Since a model is composed of multiple layers, and thus multiple activation functions, successive and complex changes of representation take place. This allows us to have a new point of view on our data that humans would be unable to have in a short time.

Further clarifications

We must not confuse the activation function with the loss function(French speakers). The loss function is applied on all the model and is therefore unique, it allows to calculate the performance of the model.

On the contrary the activation function is specific to each layer, it allows to transform the data.

The particularity of this activation function is that it is non-linear. This non-linearity allows to change the representation of the data, to have a new approach on these data. This change of representation would not be possible with a linear transformation.

Each neuron of a layer will apply the activation function of the layer on the data. This transformation will be different for each neuron because each one has a different weight.

Keep in mind : the loss function is applied on all the model, it includes, it contains the activation function. So when we calculate the gradient (the derivative of the loss function) we also calculate the derivatives of the loss functions.

The different activation functions

In Python, activation functions are used with Activation(activations.activation_function), an example with the function relu :

from tensorflow.keras import layers
from tensorflow.keras import activations

model.add(layers.Dense(64))
model.add(layers.Activation(activations.relu))

The activation function can also be used directly when calling the neural layer :

model.add(layers.Dense(64, activation='relu'))

Depending on the problem to be solved (classification, regression, …) we use different activation functions.

To choose the right activation function, we have to consider both the direct transformation it applies to the data and its derivative which will be used to adjust the weights during the backpropagation.

ReLU

The Rectified Linear Unit (ReLU) function is the simplest and most used activation function.

It gives x if x is greater than 0, 0 otherwise. In other words, it is the maximum between x and 0 :

ReLU_function(x) = max(x, 0)

ReLU function – Rectified Linear Unit

This function allows us to perform a filter on our data. It lets the positive values (x > 0) pass in the following layers of the neural network. It is used almost everywhere but not in the final layer, it is used in the intermediate layers.

tf.keras.activations.relu(x, alpha=0.0, max_value=None, threshold=0) 
  • x : input data, tensor
  • alpha : A real number that governs the slope for values below the threshold.
  • max_value : A real number that defines the saturation threshold (the largest value that the function will return).
  • threshold : A real number that gives the threshold value of the activation function below which the values will be damped or set to zero

Sigmoid

The Sigmoid function gives a value between 0 and 1, a probability. It is therefore very much used for binary classification, when a model must determine only two labels.

Thus, for the classification of movie reviews, the closer the value returned by Sigmoid is to 1, the more the model considers that the review is positive.

On the contrary, the closer it is to 0, the more it is considered as negative.

Sigmoid_function(x) = 1 / (1 + exp(-x))

Sigmoid function

The Sigmoid function is very simple to apply in Python because there is no parameter other than the input variable :

tf.keras.activations.sigmoid(x)

Softmax

The Softmax function allows to transform a real vector into a probability vector.

It is often used in the final layer of a classification model, especially for multiclass problems.

In the Softmax function, each vector is processed independently. The axis argument defines the input axis on which the function is applied.

Softmax_function(x) = exp(x) / tf.reduce_sum(exp(x))

Softmax_function(x) = exp(x) / sum(exp(xi))

Softmax function
tf.keras.activations.softmax(x, axis=-1)
  • axis: Integer, axis along which the softmax normalization is applied.

Softplus

The Softplus function is a ‘smooth’ approximation of the ReLU function. This ‘smooth‘ (or soft) aspect implies that the function is differentiable.

In fact, this function is interesting by its derivative. When we derive Softplus, we obtain the logistic function f(x) = 1/(1+exp(-x)). Recall that the derivative is used in the Backpropagation to update the weights.

It was used to constrain the result of a layer to be always positive but has been replaced by ReLU which is linear and therefore much faster to compute.

Softplus_function(x) = log(exp(x) + 1)

Softplus function
tf.keras.activations.softplus(x)

Softsign

The Softsign function is useful to normalize our data because it allows to have a result between -1 and 1 and keeps in memory the sign of the data (positive or negative). In other words, the data is centered on zero and bounded by -1 and 1.

In fact, it is the smooth sign function (Softsign) and therefore differentiable (Backpropagation obliges).

Softsign_function(x) = x / (abs(x) + 1)

Softsign function
tf.keras.activations.softsign(x) 

tanh

The tanh function is simply the hyperbolic tangent function.

It is in fact a mathematically shifted version of the sigmoid function:

  • sigmoid gives a result between 0 and 1
  • tanh gives a result between -1 and 1

The advantage of tanh is that negative entries will be well listed as negative where, with sigmoid, negative entries can be confused with near zero values.

This function is, like Sigmoid, used in binary classification. For example, for our classification of movie reviews, the closer the value returned by tanh is to 1, the more the model considers that the review is positive, the closer it is to -1, the more it is considered as negative.

Tanh works better than the sigmoid function in most cases.

tanh_function(x) = sinh(x)/cosh(x)

tanh_function(x) = ((exp(x) – exp(-x))/(exp(x) + exp(-x)))

tanh function
tf.keras.activations.tanh(x)

ELU

The Exponential Linear Unit (ELU) function is an improvement of ReLU because it allows to have smooth values when x < 0.

When x < 0, ELU has negative values different from 0 (which is not the case of ReLU). This brings the mean of the function closer to zero.

An average closer to zero allows a faster learning because it brings the calculated gradient closer to the natural gradient (a concept that deserves a whole article).

Indeed, the more x decreases, the more ELU saturates to a negative value. This saturation implies that ELU has a small derivative which decreases the variation of the result and thus the information that is propagated to the next layer.

ELU_function(x) =

  • if x > 0: x
  • if x < 0: alpha * (exp(x) – 1)

with :

  • alpha > 0
ELU Function – Exponential Linear Unit
tf.keras.activations.elu(x, alpha=1.0)
  • alpha: a scalar, a variable, which controls the slope of ELU when x < 0. The larger alpha is, the steeper the curve. This scalar must be greater than 0 (alpha > 0)

SELU

The Scaled Exponential Linear Unit (SELU) is an optimization of ELU.

The principle is the same as with ELU. We only multiply the result of ELU by a scalar. It could be written like this: function_SELU(x) = scale * function_ELU(x).

More precisely :

SELU_function(x) =

  • if x > 0: return scale * x
  • if x < 0: return scale * alpha * (exp(x) - 1)

with, as constant :

  • alpha = 1.67326324
  • scale = 1.05070098

alpha and scale are predefined, so we can’t change them but the important thing to understand here is that scale is greater than 1. This allows the slope of SELU on x > 0 to be greater than 1 and avoids some problems when calculating the gradient.

Fonction SELU – Scaled Exponential Linear Unit
tf.keras.activations.tanh(x)

This function has a specificity : when using it, we must initialize the weights with ‘lecun_normal’ as follows:

model.add(tf.keras.layers.Dense(64, kernel_initializer='lecun_normal', activation='selu'))

Customized activation functions

In a research or experimental work, it is possible that the predefined activation functions are not enough for you.

In particular, you may want to create new activation functions if the ones you are using do not produce the expected result or if you want to trigger specific transformations on your data.

To create an activation function you have to remember two things :

  • an activation function must be non-linear, i.e. different from the form f(x) = ax+b, which cannot be represented by a straight line (sigmoid and exponential functions are non-linear)
  • an activation function takes as input a tensor so, if we want to use the exponential function on Python, we must not use math.exp(x) (because here x is a real number) but tensorflow.math.exp(x) (where x is a tensor)

First example with Tensorflow and the exponential function:

We create the function

import tensorflow as tf
def ma_fonction(x, beta=1.0):
 return x * (beta * x)/tf.math.exp(-x)

Then, added it in the desired layer

from keras.layers.core import Activation

model.add(layers.Dense(16))
model.add(Activation(ma_fonction))

Second example with Keras by customizing the sigmoid function:

We create the function

from keras import backend as K def ma_fonction(x, beta=1.0):     return x * K.sigmoid(beta * x)

Then, added it in the desired layer

from keras.layers.core import Activation

model.add(layers.Dense(16))
model.add(Activation(ma_fonction))

Keep in mind that if you save a template with a custom activation function, to reuse it in another program you will have to import the activation function you created for the template to work !

Which function for which case?

As we have seen previously, several activation functions can be used in a model according to the user’s choice.

Nevertheless, the last activation function is essential because it is the one that allows to produce the result.

Thus, it will be necessary to choose the right activation function according to the type of problem we are dealing with.

We will not take a function returning a probability if our basic problem is to find the future value of a share on the stock market.

This is why we have provided you with this table to know which activation function to use in the last layer of your model depending on the type of problem.

Type of problemLast layer activation function
Binary Classificationsigmoid
Multiclass classification, single labelsoftmax
Classification multiclasse, multilabelsigmoid
Regression to arbitrary valuesnone (indeed !)
Regression to values between 0 and 1 (probabilities)sigmoid
Quelle activation pour quel problème ?

Of course, this table is not an absolute rule, but it is a guide for most cases.

More complex activation functions are available. They are called “advanced activation layers“. The best known are the PReLU and LeakyReLU functions and can be found in the module: tf.keras.layers.advanced_activations.

If you want to know more about multiclass, multilabel classification but especially know the difference between a label and a class…

… feel free to follow this article ! 😉

sources photos :

Tom Keldenich
Tom Keldenich

Data Engineer & passionate about Artificial Intelligence !

Founder of the website Inside Machine Learning

Leave a Reply

Your email address will not be published.

Beginner, expert or just curious?Discover our latest news and articles on Machine Learning

Explore Machine Learning, browse our most recent notebooks and stay up to date with the latest practices and technologies!