Extract PDF table to Excel - Best Python Tutorial Quick

In this article, we’ll look at how to quickly extract a table from a PDF to Excel using the Python language.

For this tutorial you will need two Python libraries :

tabula-py
pandas

To install them, go to your terminal/shell and type these lines of code:

pip install tabula-py
pip install pandas

If you use Google Colab, you can install these libraries directly on it. You just have to add an exclamation mark “!” in front of it, like this:

!pip install tabula-py
!pip install pandas

PDF to Excel (one table only) – Python

PDF containing several tables

[smartslider3 slider=”10″]

Photo by Aurelien Romain on Unsplash

PDF to Excel (one table only) – Python

First we load the libraries into our text editor :

import tabula
import pandas as pd

Then, we will read the pdf with the read_pdf() function of the tabula library.

This function automatically detects the tables in a pdf and converts them into DataFrames. Ideal for converting them into Excel files!

df = tabula.read_pdf('file_path/file.pdf', pages = 'all')[0]

We can then check that the table has the expected shape.

df.head()

Then convert it to an Excel file !

df.to_excel('file_path/file.xlsx')

The entire code :

By the way, if your goal is to master Deep Learning - I've prepared the Action plan to Master Neural networks. for you.

7 days of free advice from an Artificial Intelligence engineer to learn how to master neural networks from scratch:

Plan your training
Structure your projects
Develop your Artificial Intelligence algorithms

I have based this program on scientific facts, on approaches proven by researchers, but also on my own techniques, which I have devised as I have gained experience in the field of Deep Learning.

To access it, click here :

GET MY ACTION PLAN

Now we can get back to what I was talking about earlier.

import tabula import pandas as pd
df = tabula.read_pdf('file_path/file.pdf', pages = 'all')[0]
df.to_excel('file_path/file.xlsx')

PDF containing several tables

We load the libraries in our text editor :

import tabula
import pandas as pd

Then, we will read the pdf with the read_pdf() function of the tabula library.

This function automatically detects the tables in a pdf and converts them into DataFrames. Ideal to convert them then in Excel file !

Here, the variable df will be in fact a list of DataFrame. The first element corresponds to the first table, the second to the second table, etc.

df = tabula.read_pdf('file_path/file.pdf', pages = 'all')

To save these tables separately, you will have to use a for loop that will save each table in an Excel file.

for i in range(len(df)):
 df[i].to_excel('file_'+str(i)+'.xlsx')

The entire code :

import tabula
import pandas as pd
df = tabula.read_pdf('file_path/file.pdf', pages = 'all')

for i in range(len(df)):
 df[i].to_excel('file_'+str(i)+'.xlsx')

Using Artificial Intelligence, you’ve managed to extract a table from a PDF into Excel.

But that’s only an ounce of what this technology can achieve!

Today, it’s thanks to Deep Learning that tech leaders can create the most powerful AIs.

If you want to deepen your knowledge in the field, you can access my Action plan to Master Neural networks.

A program of 7 free courses that I’ve prepared to guide you on your journey to learn Deep Learning.

If you’re interested, click here:

GET MY ACTION PLAN

sources:

Medium
Photo by Birger Strahl on Unsplash

One last word, if you want to go further and learn about Deep Learning - I've prepared for you the Action plan to Master Neural networks. for you.

7 days of free advice from an Artificial Intelligence engineer to learn how to master neural networks from scratch:

Plan your training
Structure your projects
Develop your Artificial Intelligence algorithms

To access it, click here :

GET MY ACTION PLAN

3 Comments

Rafael Maffei

20 July 2021 / 22h28 Reply

E para salvar as tabelas no mesmo arquivo .xlsx ??

And to save the tables in the same .xlsx file ??
- Tom Keldenich
  
  21 July 2021 / 18h37 Reply
  
  Hi Rafael,
  
  In order to save multiple .xlsx in the same file you can use the XlsxWriter library !
  
  To install it :
  
  !pip install XlsxWriter
  
  And to you use it, here’s the code :
  
  import tabula import pandas as pd df = tabula.read_pdf('file_path/file.pdf', pages = 'all')
  writer = pd.ExcelWriter('file_multiple_df.xlsx', engine='xlsxwriter') for i in range(len(df)): df[i].to_excel(writer, sheet_name='Sheet'+str(i))
  writer.save()
  
  You’ll have an excel file with multiple Sheet, each containing a dataframe ! 🙂
Kanhu

16 January 2022 / 7h55 Reply

Why the read_pdf is not giving me the same number of columns as the pdf for the following file? If I am right, multiple columns are merged together.
https://drive.google.com/file/d/1VRwU8C6I8KzMVz2IVKPkoqrncCcGZHh-/view?usp=sharing

Extract PDF table to Excel – Best Python Tutorial Quick

PDF to Excel (one table only) – Python

PDF containing several tables

Tom Keldenich

3 Comments

Leave a ReplyCancel Reply

PDF to Excel (one table only) – Python

PDF containing several tables

Tom Keldenich

Related Posts

Decision Tree How to Use It and Its Hyperparameters

Featurization – How to use it to boost your model

How to make Bounding Boxes in Python – Best function

3 Comments

Leave a ReplyCancel Reply