What is Quantization and how to use it with TensorFlow

In this article, we’ll look at what quantization is and how you can use it with TensorFlow to improve and accelerate your models.

Since the release of ChatGPT, researchers have been striving to shrink the ever-growing storage space used by neural networks.

Numerous optimization techniques have been discussed and experimented to reduce the size of LLMs while preserving their performance.

A major technique has emerged and caught the eye of the Artificial Intelligence world: quantization.

Here, I invite you to discover what quantization is and learn how to use it in TensorFlow.

What is Quantization

The principle of Quantization

Quantization in Deep Learning is the process of reducing the size of a model in order to optimize its prediction speed.

In other words, it allows us to obtain a faster model by reducing the space needed to store it.

The weights of a Deep Learning model are generally coded on 32 bits. To reduce them, and optimize their storage, it is possible to convert them to 16 bits, 8 bits or even less.

For example, the value 0.123456789 can be represented on 32 bits, but it’s not possible to represent it on 16 bits. This is because there isn’t enough space to represent it on 16 bits.

To represent this value on 16 bits, its size must be reduced. The 16-bit version of 0.123456789 is 0.1235.

This value represents 0.123456789 less precisely, but its storage takes up less space. The value is then said to be “quantized”.

Pros & cons of Quantization

Applied to the weights of a neural network, quantization reduces its size.

Quantization reduces the size of a neural network

Thanks to this reduced size, the model is faster. Indeed, operations to make predictions require less time in 16-bit than in 32-bit.

With the TensorFlow API default values, the model size is divided by four. What’s more, we generally see an improvement in processor latency, from 1.5 to 4 times faster.

However, this reduction also comes at a cost: the accuracy of predictions is impacted.

So, while a quantized model will be faster, it will often be less accurate.

Note: even if the model will be less accurate after quantization, the impact is usually negligible.

ModelNon-quantized Top-1 Accuracy 8-Bit Quantized Accuracy 
Nasnet-Mobile74%73%
Resnet-v2 5075.6%75%
Example of the impact of quantization on the accuracy of TensorFlow models – source

When deploying the model, it will then be necessary to determine the main factor of the model to be emphasized: precision or speed.

If the project manager favors speed, he can accelerate the use of his Deep Learning model through quantization.

The two types of quantization

Post-Training Quantization

Post-Training Quantization (PTQ) consists in quantifying the weights of a model that has already been trained.

Indeed, it’s possible to reduce the size of a model after it has adapted its weights to an objective.

For example, let’s say you have an application using a Deep Learning model. If you want to increase its speed, you can extract the model’s weights and quantify them.

Reducing AI app execution time through quantization

An already-trained model can be easily quantified. However, this approach is not optimal.

Another way is to transform a model before training, to make it more sensitive to quantization.

Quantization Aware Training

Quantization Aware Training (QAT) makes it possible to integrate quantization into the model learning process.

During training, QAT mimics the quantization process that will occur during inference.

This means that the model is trained while simulating the conditions it will encounter once quantified.

The aim of QAT is to prepare a model in such a way that the tools used later for quantization can easily and efficiently convert this model into a quantized version.

Thus, models using QAT are generally more accurate than those using PTQ. QAT is therefore the preferred approach.

Indeed, with PTQ, the model is not specifically optimized to work with reduced accuracy. Quantization after training can therefore lead to a greater loss of performance than with QAT.

QAT prepares Deep Learning models for quantization. If you want to quantize a model and have both options open to you, then I recommend using QAT.

When is a Deep Learning model quantized?

During QAT, the model is not quantized. More precisely, it solely takes into account future quantization.

With TensorFlow, quantization takes place when the model is converted to TensorFlow Lite format.

Note: TensorFlow Lite is TensorFlow’s framework for deploying Machine Learning models on mobile and high-end devices.

During training, the model is therefore not quantized.

To perform quantization, you’ll need to convert the model from TensorFlow format to TensorFlow Lite format.

I’ll show you how in the next section!

How to use Quantization

Converting a Basic Model into a Quantized Model – PTQ

In this article, we’ll concentrate on QAT, as it offers a better basis of comparison for assessing quantization performance (see next section).

However, if you’d like to use PTQ, here’s the code you’ll need:

import tensorflow as tf

converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_quant_model = converter.convert()

Note: saved_model_dir is the path to the folder where your model is stored.

Here, we use the TFLiteConverter module to convert the TensorFlow model to TensorFlow Lite format. Then converter.optimizations = [tf.lite.Optimize.DEFAULT] is used to quantize our model.

That’s how to transform a basic model into a quantized model in just 4 lines of code.

Now let’s move on to the core tutorial of this article: Quantization Aware Training.

Training a Model from Scratch with Quantization – QAT

Here, I’m going to show you how to quantize a model from scratch.

To do this, we’re going to train a Deep Learning model on data using QAT. Then we’ll quantify it and compare its performance to the base model.

First, we load some data (for this example I’m using the MNIST dataset) and define the structure of our model:

import tensorflow as tf

# Load MNIST dataset
mnist = tf.keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

# Normalize the input image so that each pixel value is between 0 to 1.
train_images = train_images / 255.0
test_images = test_images / 255.0

# Define the model architecture.
model = tf.keras.Sequential([
  tf.keras.layers.InputLayer(input_shape=(28, 28)),
  tf.keras.layers.Reshape(target_shape=(28, 28, 1)),
  tf.keras.layers.Conv2D(filters=12, kernel_size=(3, 3), activation='relu'),
  tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(10)
])

Next, we can install the tensorflow-model-optimization library, which will enable us to quantify our model:

pip install -q tensorflow-model-optimization

Note: if you want to use this command directly in a Python notebook, simply add a ! at the beginning of the line.

Next, we can make our model quantization aware with tfmot.quantization.keras.quantize_model:

By the way, if your goal is to master Deep Learning - I've prepared the Action plan to Master Neural networks. for you.

7 days of free advice from an Artificial Intelligence engineer to learn how to master neural networks from scratch:

  • Plan your training
  • Structure your projects
  • Develop your Artificial Intelligence algorithms

I have based this program on scientific facts, on approaches proven by researchers, but also on my own techniques, which I have devised as I have gained experience in the field of Deep Learning.

To access it, click here :

GET MY ACTION PLAN

GET MY ACTION PLAN

Now we can get back to what I was talking about earlier.

import tensorflow_model_optimization as tfmot

quantize_model = tfmot.quantization.keras.quantize_model

q_aware_model = quantize_model(model)

Now we have two models: model and q_aware_model.

I suggest we train them and compare their performance.

Let’s start with the model (non-quantized):

model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy']
              )

model.fit(
  train_images,
  train_labels,
  epochs=10,
  validation_split=0.1,
)

Output:
...
Epoch 10/10
1688/1688 [===] - 17s 10ms/step - loss: 0.0335 - accuracy: 0.9900 - val_loss: 0.0560 - val_accuracy: 0.9862

Then we can move on to the q_aware_model:

q_aware_model.compile(optimizer='adam',
                      loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                      metrics=['accuracy']
                      )

q_aware_model.fit(
  train_images,
  train_labels,
  epochs=10,
  validation_split=0.1,
)

Output:
...
Epoch 10/10
1688/1688 [===] - 20s 12ms/step - loss: 0.0350 - accuracy: 0.9894 - val_loss: 0.0575 - val_accuracy: 0.9855

Note: a quantization aware model can be used in the same way as a non-quantized model.

Now that our two models are trained, we can compare their performance:

_, baseline_model_accuracy = model.evaluate(
    test_images, test_labels, verbose=0)

_, q_aware_model_accuracy = q_aware_model.evaluate(
   test_images, test_labels, verbose=0)

print('Baseline test accuracy:', baseline_model_accuracy)
print('Quant test accuracy:', q_aware_model_accuracy)

Output:
Baseline test accuracy: 0.9821000099182129
Quant test accuracy: 0.9833999872207642

Here, the quantization aware model achieves better accuracy than the base model.

This won’t always be the case.

As seen above, the quantization aware model is expected to have a lower accuracy than the base model. However, since the two models undergo a different training process, a higher accuracy can be achieved with the quantization aware model.

Nevertheless, most of the time, the quantization aware model will achieve a lower accuracy.

Quantization

Finally, we can quantify the model.

To do this, simply convert it to TensorFLow Lite format using the TFLiteConverter module. Then specify converter.optimizations = [tf.lite.Optimize.DEFAULT].

In the following code, both models are converted to TensorFLow Lite format. One will be non-quantized, the other will be quantized:

# Create float TFLite model.
float_converter = tf.lite.TFLiteConverter.from_keras_model(model)
float_tflite_model = float_converter.convert()

# Create quant TFLite model.
converter = tf.lite.TFLiteConverter.from_keras_model(q_aware_model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
quantized_tflite_model = converter.convert()

Note: here I’m using default quantization, but you can experiment with other approaches by reading the TensorFlow documentation.

Size performance

To compare the two models, we can save them in .tflite format and display their size:

import os

# Measure sizes of models.
with open('model.tflite', 'wb') as f:
  f.write(float_tflite_model)

with open('quant_model.tflite', 'wb') as f:
  f.write(quantized_tflite_model)

model_size = os.path.getsize('model.tflite') / float(2**20)
quant_model_size = os.path.getsize('quant_model.tflite') / float(2**20)

print("Float model in Mb:", model_size)
print("Quantized model in Mb:", quant_model_size)

Output:
Float model in Mb: 0.08089065551757812
Quantized model in Mb: 0.0238037109375

Quantization has reduced the size of the model by a factor of 4!

This achievement is even more striking when displayed on a graph:

Storage space comparison – Basic model vs. Quantize model

The model is now quantified.

But questions remain: after quantization, what is the model’s performance? Has it been drastically reduced?

Let’s find out.

Accuracy performance

In this code, I create a function to evaluate a TensorFlow Lite model (the process is different from that of a classic TensorFlow model):

import numpy as np
import time

def evaluate_tflite_model(model, test_data):
  interpreter = tf.lite.Interpreter(model_content=model)
  interpreter.allocate_tensors()

  prediction_digits = []

  start = time.time()

  for i, sample_test_data in enumerate(test_data):
    output = interpreter.get_output_details()[0]  # Model has single output.
    input = interpreter.get_input_details()[0]  # Model has single input.
    input_data = tf.constant(1., shape=[1, 1])
    interpreter.set_tensor(input['index'], np.expand_dims(sample_test_data, axis=0).astype(np.float32))
    interpreter.invoke()
    output = interpreter.get_tensor(output['index'])
    digit = np.argmax(output[0])
    prediction_digits.append(digit)

  prediction_digits = np.array(prediction_digits)
  accuracy = (prediction_digits == test_labels).mean()

  inference_time = time.time() - start

  return inference_time, accuracy



We can then use the function with the quantized model and compare the result with that of the quantization aware model:

res_quant_model = evaluate_tflite_model(quantized_tflite_model, test_images)

print('Quant Aware TF test accuracy:', q_aware_model_accuracy)
print('Quant TFLite test accuracy:', res_quant_model[1])

Output:
Quant Aware TF test accuracy: 0.9833999872207642
Quant TFLite test accuracy: 0.9832

We can see that the quantized model underwent a slight degradation in performance after quantization. Nevertheless, this loss is most likely negligible.

Speed performance

Now we can compare the speed performance of the base model with that of the quantized model.

Here, I separate the test dataset into 5 equal parts and evaluate both models:

test_batches = np.array_split(test_images, 5)

model_infs_time = []
quant_model_infs_time = []
infs_nbr = []
i = 0

for test_batch in test_batches:
  i += 1
  model_inf_time, _ = evaluate_tflite_model(float_tflite_model, test_images)
  quant_model_inf_time, _ = evaluate_tflite_model(quantized_tflite_model, test_images)

  model_infs_time.append(model_inf_time)
  quant_model_infs_time.append(quant_model_inf_time)
  infs_nbr.append('Inference_'+str(i))

Next, we can display the inference time of the two models to determine which is faster:

import matplotlib.pyplot as plt
import numpy as np

left = np.arange(len(model_infs_time))
width = 0.3

fig = plt.figure(figsize=(6, 4))
fig.patch.set_alpha(1)

plt.bar(left, model_infs_time, color='#d1aa00', width=width, label="Base Model", align="center")
plt.bar(left + [width], quant_model_infs_time, color='#066b8b', width=width, label="Quantize Model", align="center")

plt.xticks(left + width / 2, infs_nbr)
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.ylabel("Inference time in s")
plt.title("Inference performance")
plt.show()
Inference speed comparison – Basic model vs. Quantize model

Note: quantization can improve the inference speed of Deep Learning models, but there are also techniques for improving the training speed of traditional Machine Learning models. If you’d like to read my article on the subject, click here.

The quantized model is faster than the basic model on all inferences.

The average time saved on this example is 1.34 seconds. A valuable time saving for many Deep Learning projects.

Quantization makes it possible to reduce the size of a Deep Learning model.

While imposing a (negligible) reduction in performance, this technique greatly improves execution time.

Optimizing a Deep Learning model is an important step in any project. In addition to quantization, it’s important that your model is structured correctly to achieve good performance.

If you want to deepen your knowledge in the field, you can access my Action plan to Master Neural networks.

A program of 7 free courses that I’ve prepared to guide you on your journey to master Deep Learning.

If you’re interested, click here:

GET MY ACTION PLAN

One last word, if you want to go further and learn about Deep Learning - I've prepared for you the Action plan to Master Neural networks. for you.

7 days of free advice from an Artificial Intelligence engineer to learn how to master neural networks from scratch:

  • Plan your training
  • Structure your projects
  • Develop your Artificial Intelligence algorithms

I have based this program on scientific facts, on approaches proven by researchers, but also on my own techniques, which I have devised as I have gained experience in the field of Deep Learning.

To access it, click here :

GET MY ACTION PLAN

GET MY ACTION PLAN

Tom Keldenich
Tom Keldenich

Artificial Intelligence engineer and data enthusiast!

Founder of the website Inside Machine Learning

Leave a Reply

Your email address will not be published. Required fields are marked *

This page will not stay online forever

Enter your email to receive for free

The PANE method for Deep Learning

* indicates required

 

You will receive one email per day for 7 days – then you will receive my newsletter.
Your information will never be given to third parties.

You can unsubscribe in 1 click from any of my emails.



Entre ton email pour recevoir gratuitement
la méthode PARÉ pour faire du Deep Learning


Tu recevras un email par jour pendant 7 jours - puis tu recevras ma newsletter.
Tes informations ne seront jamais cédées à des tiers.

Tu peux te désinscrire en 1 clic depuis n'importe lequel de mes emails.