Extract PDF table to Excel – Best Python Tutorial Quick

In this article, we’ll look at how to quickly extract a table from a PDF to Excel using the Python language.

For this tutorial you will need two Python libraries :

  • tabula-py
  • pandas

To install them, go to your terminal/shell and type these lines of code:

pip install tabula-py
pip install pandas

If you use Google Colab, you can install these libraries directly on it. You just have to add an exclamation mark “!” in front of it, like this:

!pip install tabula-py
!pip install pandas

[smartslider3 slider=”10″]

Photo by Aurelien Romain on Unsplash

PDF to Excel (one table only) – Python

First we load the libraries into our text editor :

import tabula
import pandas as pd

Then, we will read the pdf with the read_pdf() function of the tabula library.

This function automatically detects the tables in a pdf and converts them into DataFrames. Ideal for converting them into Excel files!

df = tabula.read_pdf('file_path/file.pdf', pages = 'all')[0]

We can then check that the table has the expected shape.

df.head()

Then convert it to an Excel file !

df.to_excel('file_path/file.xlsx')

The entire code :

By the way, if your goal is to master Deep Learning - I've prepared the Action plan to Master Neural networks. for you.

7 days of free advice from an Artificial Intelligence engineer to learn how to master neural networks from scratch:

  • Plan your training
  • Structure your projects
  • Develop your Artificial Intelligence algorithms

I have based this program on scientific facts, on approaches proven by researchers, but also on my own techniques, which I have devised as I have gained experience in the field of Deep Learning.

To access it, click here :

GET MY ACTION PLAN

GET MY ACTION PLAN

Now we can get back to what I was talking about earlier.

import tabula import pandas as pd
df = tabula.read_pdf('file_path/file.pdf', pages = 'all')[0]
df.to_excel('file_path/file.xlsx')
Photo by Darius Cotoi on Unsplash

PDF containing several tables

We load the libraries in our text editor :

import tabula
import pandas as pd

Then, we will read the pdf with the read_pdf() function of the tabula library.

This function automatically detects the tables in a pdf and converts them into DataFrames. Ideal to convert them then in Excel file !

Here, the variable df will be in fact a list of DataFrame. The first element corresponds to the first table, the second to the second table, etc.

df = tabula.read_pdf('file_path/file.pdf', pages = 'all')

To save these tables separately, you will have to use a for loop that will save each table in an Excel file.

for i in range(len(df)):
 df[i].to_excel('file_'+str(i)+'.xlsx')

The entire code :

import tabula
import pandas as pd
df = tabula.read_pdf('file_path/file.pdf', pages = 'all')

for i in range(len(df)):
 df[i].to_excel('file_'+str(i)+'.xlsx')

Using Artificial Intelligence, you’ve managed to extract a table from a PDF into Excel.

But that’s only an ounce of what this technology can achieve!

Today, it’s thanks to Deep Learning that tech leaders can create the most powerful AIs.

If you want to deepen your knowledge in the field, you can access my Action plan to Master Neural networks.

A program of 7 free courses that I’ve prepared to guide you on your journey to learn Deep Learning.

If you’re interested, click here:

GET MY ACTION PLAN

sources:

One last word, if you want to go further and learn about Deep Learning - I've prepared for you the Action plan to Master Neural networks. for you.

7 days of free advice from an Artificial Intelligence engineer to learn how to master neural networks from scratch:

  • Plan your training
  • Structure your projects
  • Develop your Artificial Intelligence algorithms

I have based this program on scientific facts, on approaches proven by researchers, but also on my own techniques, which I have devised as I have gained experience in the field of Deep Learning.

To access it, click here :

GET MY ACTION PLAN

GET MY ACTION PLAN

Tom Keldenich
Tom Keldenich

Artificial Intelligence engineer and data enthusiast!

Founder of the website Inside Machine Learning

3 Comments

  1. E para salvar as tabelas no mesmo arquivo .xlsx ??

    And to save the tables in the same .xlsx file ??

    • Hi Rafael,

      In order to save multiple .xlsx in the same file you can use the XlsxWriter library !

      To install it :

      !pip install XlsxWriter

      And to you use it, here’s the code :

      import tabula
      import pandas as pd
      df = tabula.read_pdf('file_path/file.pdf', pages = 'all')

      writer = pd.ExcelWriter('file_multiple_df.xlsx', engine='xlsxwriter')

      for i in range(len(df)):
      df[i].to_excel(writer, sheet_name='Sheet'+str(i))

      writer.save()

      You’ll have an excel file with multiple Sheet, each containing a dataframe ! 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *

This page will not stay online forever

Enter your email to receive for free

The PANE method for Deep Learning

* indicates required

 

You will receive one email per day for 7 days – then you will receive my newsletter.
Your information will never be given to third parties.

You can unsubscribe in 1 click from any of my emails.



Entre ton email pour recevoir gratuitement
la méthode PARÉ pour faire du Deep Learning


Tu recevras un email par jour pendant 7 jours - puis tu recevras ma newsletter.
Tes informations ne seront jamais cédées à des tiers.

Tu peux te désinscrire en 1 clic depuis n'importe lequel de mes emails.