In **this article** we will see **in details** what is an **activation function** and its use in a **Deep Learning model !**

**Do you remember ?** In this article(for French speakers) we saw that **all the neurons** of a Deep Learning model **apply a transformation** to the data they receive as input… Well this transformation **consists** in applying the weights(for French speakers) and then **the activation function**.

**The user** chooses the **activation function** that will be **applied** in the neurons of a layer. For example with **this code** we create a *Dense* **layer** with the sigmoid function as **activation function** :

`model.add(layers.Dense(1, activation='sigmoid'))`

## What is an activation function ?

### Change the perspective

The activation function is **primarily used to modify** the data **in a non-linear way**. This non-linearity allows to **spatially modify the representation of the data**.

Simply put, the **activation function** allows us to **change the way we see a data**.

For example, if we have as data : **each week 50% of the customers of a store buy chocolate bars**; the activation function would allow to change the data into **50% of the customers like chocolate** or **50% of the customers plan to buy chocolate each week**.

The **change in representation** can, in our example, allow us to **adjust the marketing strategy**, or **the inventory holdings**, of our store.

Since a model is composed of **multiple layers**, and thus **multiple activation functions**, **successive and complex changes of representation **take place. This allows us to have **a new point of view on our data** that humans would be unable to have in a short time.

### Further clarifications

**We must not confuse** the activation function with the loss function(French speakers). The loss function is applied on all the model and is therefore **unique**, it allows to **calculate the performance** of the model.

On the contrary the **activation function** is specific to **each layer**, it allows to **transform** the data.

The **particularity** of this activation function is that it is **non-linear**. This non-linearity allows to **change the representation of the data**, to have a **new approach** on these data. This change of representation would not be possible with a linear transformation.

**Each neuron** of a layer will **apply the activation function** of the layer on the data. This **transformation** will be **different for each neuron** because each one has a **different weight**.

**Keep in mind** : the loss function is applied on **all the model**, it includes, it contains the activation function. So when we **calculate the gradient** (the derivative of the loss function) we also **calculate the derivatives of the loss functions**.

## The different activation functions

In Python, **activation functions** are used with *Activation(activations.activation_ functio*n

*)*, an

**example**with the function

*relu*:

```
from tensorflow.keras import layers
from tensorflow.keras import activations
model.add(layers.Dense(64))
model.add(layers.Activation(activations.relu))
```

The **activation function** can also be used **directly** when calling **the neural layer** :

`model.add(layers.Dense(64, activation='relu'))`

Depending on **the problem to be solved** (classification, regression, …) we use **different activation functions**.

To choose the** right activation function**, we have to consider both the direct transformation it applies to the data and its derivative which will be used to adjust the weights during the backpropagation.

### ReLU

The **Rectified Linear Unit** (**ReLU**) function is **the simplest** and **most used** activation function.

It gives **x** if **x is greater than 0**, **0 otherwise**. In other words, it is the **maximum between x and 0** :

ReLU_function(x) = max(x, 0)

This function allows us to perform a **filter** on our data. **It lets the positive values** (x > 0) pass in the **following layers **of **the neural network**. It is **used almost everywhere** but **not in the final layer**, it is used in the **intermediate layers**.

`tf.keras.activations.relu(x, alpha=0.0, max_value=None, threshold=0) `

- x : input data, tensor
- alpha : A real number that governs the slope for values below the threshold.
- max_value : A real number that defines the saturation threshold (the largest value that the function will return).
- threshold : A real number that gives the threshold value of the activation function below which the values will be damped or set to zero

### Sigmoid

The **Sigmoid** function gives a** value between 0 and 1**, a probability. It is therefore very much used for **binary classification**, when a **model** must **determine only two labels**.

Thus, for the classification of movie reviews, **the closer the value** returned by Sigmoid **is to 1**, the more the **model considers **that the **review is positive**.

On the contrary,** the closer it is to 0**, the more it is **considered as negative**.

Sigmoid_function(x) = 1 / (1 + exp(-x))

The **Sigmoid function** is very **simple** to apply in Python because there is **no parameter** other than the **input variable** :

`tf.keras.activations.sigmoid(x)`

### Softmax

The **Softmax** function allows to **transform a real vector** into a **probability vector**.

It is **often used in the final layer** of a **classification** model, especially for **multiclass problems**.

In the Softmax function, **each vector** is **processed independently**. The axis argument defines the **input axis** on which the **function is applied**.

Softmax_function(x) = exp(x) / tf.reduce_sum(exp(x))

Softmax_function(x) = exp(x) / sum(exp(x_{i}))

`tf.keras.activations.softmax(x, axis=-1)`

- axis: Integer, axis along which the softmax normalization is applied.

### Softplus

The **Softplus** function is a **‘smooth’ approximation** of the **ReLU** function. This ‘**smooth**‘ (or soft) aspect implies that the function is **differentiable**.

In fact, this function is interesting by its **derivative**. When we **derive Softplus**, we obtain the **logistic function** f(x) = 1/(1+exp(-x)). Recall that the **derivative** is used in the **Backpropagation to update the weights**.

It was used to **constrain the result **of a layer to be always **positive** but has been replaced by **ReLU** which is **linear** and therefore **much faster to compute**.

Softplus_function(x) = log(exp(x) + 1)

`tf.keras.activations.softplus(x)`

### Softsign

The **Softsign** function is useful to **normalize our data** because it allows to have a result **between -1 and 1 **and keeps in memory the **sign of the data (positive or negative)**. In other words, the data is **centered on zero** and **bounded by -1 and 1**.

In fact, it is the **smooth** sign function (**Softsign**) and therefore **differentiable** (Backpropagation obliges).

Softsign_function(x) = x / (abs(x) + 1)

`tf.keras.activations.softsign(x) `

### tanh

The **tanh** function is simply the **hyperbolic tangent** function.

It is in fact a mathematically **shifted version** of the sigmoid function:

**sigmoid**gives a result between**0 and 1****tanh**gives a result between**-1 and 1**

The **advantage of tanh** is that **negative entries** will be well **listed **as **negative** where, with **sigmoid**, **negative entries **can be **confused** with **near zero** values.

This function is, **like Sigmoid**, used in **binary classification**. For example, for our classification of movie reviews, **the closer** the value returned by tanh **is to 1**, the more the **model considers** that the **review is positive**, **the closer it is to -1**, the more it is **considered as negative**.

**Tanh works better** than the sigmoid function **in most cases**.

tanh_function(x) = sinh(x)/cosh(x)

tanh_function(x) = ((exp(x) – exp(-x))/(exp(x) + exp(-x)))

`tf.keras.activations.tanh(x)`

### ELU

The **Exponential Linear Unit** (**ELU**) function is an **improvement of ReLU **because it allows to have **smooth values** when x < 0.

When x < 0, ELU has **negative values different from 0** (which is not the case of ReLU). This brings **the mean of the function closer to zero**.

An average **closer to zero** allows a **faster learning** because it brings the calculated gradient closer to the **natural gradient** (a concept that deserves a whole article).

Indeed, the **more x decreases**, the more ELU saturates to a **negative value**. This saturation implies that ELU has a small derivative which **decreases the variation of the result** and thus the information that is **propagated to the next layer**.

ELU_function(x) =

- if x > 0: x
- if x < 0: alpha * (exp(x) – 1)

with :

- alpha > 0

`tf.keras.activations.elu(x, alpha=1.0)`

**alpha**: a**scalar**, a**variable**, which**controls the slope of ELU**when**x < 0**. The larger**alpha**is, the**steeper the curve**. This scalar must be**greater than 0**(**alpha > 0**)

### SELU

The **Scaled Exponential Linear Unit** (**SELU**) is an optimization of **ELU**.

The **principle** is the same as with **ELU**. We only multiply the **result of ELU **by a **scalar**. It could be written like this: function_SELU(x) = scale * function_ELU(x).

More **precisely** :

SELU_function(x) =

`if x > 0: return scale * x`

`if x < 0: return scale * alpha * (exp(x) - 1)`

with, as **constant** :

- alpha = 1.67326324
- scale = 1.05070098

**alpha** and **scale** are **predefined**, so we can’t change them but the **important thing to understand** here is that **scale is greater than 1**. This allows the **slope of SELU** on x > 0 to be greater than 1 and avoids **some problems** when calculating the gradient.

`tf.keras.activations.tanh(x)`

This **function** has a **specificity** : when using it, we must **initialize the weights** with ‘lecun_normal’ as follows:

`model.add(tf.keras.layers.Dense(64, kernel_initializer='lecun_normal', activation='selu'))`

### Customized activation functions

In a **research** or **experimental work**, it is possible that the **predefined activation functions** are not enough for you.

In particular, you may want to **create new activation functions** if the ones you are using do not produce the **expected result** or if you want to **trigger specific transformations** on your data.

To **create** an activation function you have to **remember two things** :

- an activation function must be
**non-linear**, i.e. different from the form f(x) = ax+b, which cannot be represented by a straight line (**sigmoid**and**exponential**functions are**non-linear**) - an activation function takes as
**input a tensor**so, if we want to use the exponential function on Python, we must not use math.exp(x) (because here x is a real number) but tensorflow.math.exp(x) (where x is a tensor)

First example with **Tensorflow** and the **exponential** function:

We **create the function**

```
import tensorflow as tf
def ma_fonction(x, beta=1.0):
return x * (beta * x)/tf.math.exp(-x)
```

Then, **added it in the desired layer**

```
from keras.layers.core import Activation
model.add(layers.Dense(16))
model.add(Activation(ma_fonction))
```

Second example with **Keras** by **customizing** the **sigmoid function:**

We **create the function**

`from keras import backend as K def ma_fonction(x, beta=1.0): return x * K.sigmoid(beta * x)`

Then, **added it in the desired layer**

```
from keras.layers.core import Activation
model.add(layers.Dense(16))
model.add(Activation(ma_fonction))
```

**Keep in mind** that if you **save a template** with a **custom activation function**, to reuse it in another program you will have to **import the activation function** you created for the template to work !

## Which function for which case?

As we have seen previously, **several activation functions** can be used in a model **according to the user’s choice**.

Nevertheless, **the last activation function is essential** because it is the one that allows to **produce the result**.

Thus, it will be necessary to choose **the right activation function** according to **the type of problem** we are dealing with.

We will not take a **function returning a probability** if our basic problem is to find **the future value of a share on the stock market**.

This is why we have** provided you with this table** to know **which activation function to use in the last layer of your model depending on the type of problem.**

Type of problem | Last layer activation function |

Binary Classification | sigmoid |

Multiclass classification, single label | softmax |

Classification multiclasse, multilabel | sigmoid |

Regression to arbitrary values | none (indeed !) |

Regression to values between 0 and 1 (probabilities) | sigmoid |

Of course, **this table** is not an absolute rule, but it is **a guide for most cases**.

**More complex** activation functions are **available**. They are called “**advanced activation layers**“. The best known are the **PReLU** and **LeakyReLU** functions and can be found in the module: *tf.keras.layers.advanced_activations*.

If you want to **know more about multiclass, multilabel** classification but especially know the difference between a label and a class…

… feel free to follow this article ! 😉

sources photos :

- Ashim D’Silva on Unsplash
- Dimitri Simon on Unsplash
- Mark Basarab on Unsplash