Quickly extract Table from PDF to Excel with Python

In this article we will see how to quickly extract a table from a PDF to Excel.

For this tutorial you will need two Python libraries :

  • tabula-py
  • pandas

To install them, go to your terminal/shell and type these lines of code:

pip install tabula-py
pip install pandas

If you use Google Colab, you can install these libraries directly on it. You just have to add an exclamation mark “!” in front of it, like this:

!pip install tabula-py
!pip install pandas

[smartslider3 slider=”10″]

Photo by Aurelien Romain on Unsplash

PDF to Excel (one table only)

First we load the libraries into our text editor :

import tabula
import pandas as pd

Then, we will read the pdf with the read_pdf() function of the tabula library.

This function automatically detects the tables in a pdf and converts them into DataFrames. Ideal for converting them into Excel files!

df = tabula.read_pdf('file_path/file.pdf', pages = 'all')[0]

We can then check that the table has the expected shape.

df.head()

Then convert it to an Excel file !

df.to_excel('file_path/file.xlsx')

The entire code :

THE PANE METHOD FOR DEEP LEARNING!

Get your 7 DAYS FREE TRAINING to learn how to create your first ARTIFICIAL INTELLIGENCE!

For the next 7 days I will show you how to use Neural Networks.

You will learn what Deep Learning is with concrete examples that will stick in your head.

BEWARE, this email series is not for everyone. If you are the kind of person who likes theoretical and academic courses, you can skip it.

But if you want to learn the PANE method to do Deep Learning, click here :

import tabula import pandas as pd
df = tabula.read_pdf('file_path/file.pdf', pages = 'all')[0]
df.to_excel('file_path/file.xlsx')
Photo by Darius Cotoi on Unsplash

PDF containing several tables

We load the libraries in our text editor :

import tabula
import pandas as pd

Then, we will read the pdf with the read_pdf() function of the tabula library.

This function automatically detects the tables in a pdf and converts them into DataFrames. Ideal to convert them then in Excel file !

Here, the variable df will be in fact a list of DataFrame. The first element corresponds to the first table, the second to the second table, etc.

df = tabula.read_pdf('file_path/file.pdf', pages = 'all')

To save these tables separately, you will have to use a for loop that will save each table in an Excel file.

for i in range(len(df)):
 df[i].to_excel('file_'+str(i)+'.xlsx')

The entire code :

import tabula
import pandas as pd
df = tabula.read_pdf('file_path/file.pdf', pages = 'all')

for i in range(len(df)):
 df[i].to_excel('file_'+str(i)+'.xlsx')

sources:

THE PANE METHOD FOR DEEP LEARNING!

Get your 7 DAYS FREE TRAINING to learn how to create your first ARTIFICIAL INTELLIGENCE!

For the next 7 days I will show you how to use Neural Networks.

You will learn what Deep Learning is with concrete examples that will stick in your head.

BEWARE, this email series is not for everyone. If you are the kind of person who likes theoretical and academic courses, you can skip it.

But if you want to learn the PANE method to do Deep Learning, click here :

Tom Keldenich
Tom Keldenich

Data Engineer & passionate about Artificial Intelligence !

Founder of the website Inside Machine Learning

3 Comments

  1. E para salvar as tabelas no mesmo arquivo .xlsx ??

    And to save the tables in the same .xlsx file ??

    • Hi Rafael,

      In order to save multiple .xlsx in the same file you can use the XlsxWriter library !

      To install it :

      !pip install XlsxWriter

      And to you use it, here’s the code :

      import tabula
      import pandas as pd
      df = tabula.read_pdf('file_path/file.pdf', pages = 'all')

      writer = pd.ExcelWriter('file_multiple_df.xlsx', engine='xlsxwriter')

      for i in range(len(df)):
      df[i].to_excel(writer, sheet_name='Sheet'+str(i))

      writer.save()

      You’ll have an excel file with multiple Sheet, each containing a dataframe ! 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *

Enter your email to receive for free

The PANE method for Deep Learning

* indicates required

 

You will receive one email per day for 7 days – then you will receive my newsletter.
Your information will never be given to third parties.

You can unsubscribe in 1 click from any of my emails.

Entre ton email pour recevoir gratuitement
la méthode PARÉ pour faire du Deep Learning


Tu recevras un email par jour pendant 7 jours - puis tu recevras ma newsletter.
Tes informations ne seront jamais cédées à des tiers.

Tu peux te désinscrire en 1 clic depuis n'importe lequel de mes emails.