Quickly extract Table from PDF to Excel with Python

In this article we will see how to quickly extract a table from a PDF to Excel.

For this tutorial you will need two Python libraries :

  • tabula-py
  • pandas

To install them, go to your terminal/shell and type these lines of code:

pip install tabula-py
pip install pandas

If you use Google Colab, you can install these libraries directly on it. You just have to add an exclamation mark “!” in front of it, like this:

!pip install tabula-py
!pip install pandas

[smartslider3 slider=”10″]

Photo by Aurelien Romain on Unsplash

PDF to Excel (one table only)

First we load the libraries into our text editor :

import tabula
import pandas as pd

Then, we will read the pdf with the read_pdf() function of the tabula library.

This function automatically detects the tables in a pdf and converts them into DataFrames. Ideal for converting them into Excel files!

df = tabula.read_pdf('file_path/file.pdf', pages = 'all')[0]

We can then check that the table has the expected shape.

df.head()

Then convert it to an Excel file !

df.to_excel('file_path/file.xlsx')

The entire code :

import tabula import pandas as pd
df = tabula.read_pdf('file_path/file.pdf', pages = 'all')[0]
df.to_excel('file_path/file.xlsx')
Photo by Darius Cotoi on Unsplash

PDF containing several tables

We load the libraries in our text editor :

import tabula
import pandas as pd

Then, we will read the pdf with the read_pdf() function of the tabula library.

This function automatically detects the tables in a pdf and converts them into DataFrames. Ideal to convert them then in Excel file !

Here, the variable df will be in fact a list of DataFrame. The first element corresponds to the first table, the second to the second table, etc.

df = tabula.read_pdf('file_path/file.pdf', pages = 'all')

To save these tables separately, you will have to use a for loop that will save each table in an Excel file.

for i in range(len(df)):
 df[i].to_excel('file_'+str(i)+'.xlsx')

The entire code :

import tabula
import pandas as pd
df = tabula.read_pdf('file_path/file.pdf', pages = 'all')

for i in range(len(df)):
 df[i].to_excel('file_'+str(i)+'.xlsx')

sources:

Tom Keldenich
Tom Keldenich

Data Engineer & passionate about Artificial Intelligence !

Founder of the website Inside Machine Learning

3 Comments

  1. E para salvar as tabelas no mesmo arquivo .xlsx ??

    And to save the tables in the same .xlsx file ??

    • Hi Rafael,

      In order to save multiple .xlsx in the same file you can use the XlsxWriter library !

      To install it :

      !pip install XlsxWriter

      And to you use it, here’s the code :

      import tabula
      import pandas as pd
      df = tabula.read_pdf('file_path/file.pdf', pages = 'all')

      writer = pd.ExcelWriter('file_multiple_df.xlsx', engine='xlsxwriter')

      for i in range(len(df)):
      df[i].to_excel(writer, sheet_name='Sheet'+str(i))

      writer.save()

      You’ll have an excel file with multiple Sheet, each containing a dataframe ! 🙂

Leave a Reply

Your email address will not be published.

Beginner, expert or just curious?Discover our latest news and articles on Machine Learning

Explore Machine Learning, browse our most recent notebooks and stay up to date with the latest practices and technologies!