In this article we will see how to quickly extract a table from a PDF to Excel.
For this tutorial you will need two Python libraries :
- tabula-py
- pandas
To install them, go to your terminal/shell and type these lines of code:
pip install tabula-py
pip install pandas
If you use Google Colab, you can install these libraries directly on it. You just have to add an exclamation mark “!” in front of it, like this:
!pip install tabula-py
!pip install pandas
[smartslider3 slider=”10″]
Photo by Aurelien Romain on Unsplash
PDF to Excel (one table only)
First we load the libraries into our text editor :
import tabula
import pandas as pd
Then, we will read the pdf with the read_pdf() function of the tabula library.
This function automatically detects the tables in a pdf and converts them into DataFrames. Ideal for converting them into Excel files!
df = tabula.read_pdf('file_path/file.pdf', pages = 'all')[0]
We can then check that the table has the expected shape.
df.head()
Then convert it to an Excel file !
df.to_excel('file_path/file.xlsx')
The entire code :
THE PANE METHOD FOR DEEP LEARNING!
Get your 7 DAYS FREE TRAINING to learn how to create your first ARTIFICIAL INTELLIGENCE!
For the next 7 days I will show you how to use Neural Networks.
You will learn what Deep Learning is with concrete examples that will stick in your head.
BEWARE, this email series is not for everyone. If you are the kind of person who likes theoretical and academic courses, you can skip it.
But if you want to learn the PANE method to do Deep Learning, click here :
import tabula import pandas as pd
df = tabula.read_pdf('file_path/file.pdf', pages = 'all')[0]
df.to_excel('file_path/file.xlsx')

PDF containing several tables
We load the libraries in our text editor :
import tabula
import pandas as pd
Then, we will read the pdf with the read_pdf() function of the tabula library.
This function automatically detects the tables in a pdf and converts them into DataFrames. Ideal to convert them then in Excel file !
Here, the variable df will be in fact a list of DataFrame. The first element corresponds to the first table, the second to the second table, etc.
df = tabula.read_pdf('file_path/file.pdf', pages = 'all')
To save these tables separately, you will have to use a for loop that will save each table in an Excel file.
for i in range(len(df)):
df[i].to_excel('file_'+str(i)+'.xlsx')
The entire code :
import tabula
import pandas as pd
df = tabula.read_pdf('file_path/file.pdf', pages = 'all')
for i in range(len(df)):
df[i].to_excel('file_'+str(i)+'.xlsx')
sources:
- Medium
- Photo by Birger Strahl on Unsplash
THE PANE METHOD FOR DEEP LEARNING!
Get your 7 DAYS FREE TRAINING to learn how to create your first ARTIFICIAL INTELLIGENCE!
For the next 7 days I will show you how to use Neural Networks.
You will learn what Deep Learning is with concrete examples that will stick in your head.
BEWARE, this email series is not for everyone. If you are the kind of person who likes theoretical and academic courses, you can skip it.
But if you want to learn the PANE method to do Deep Learning, click here :
E para salvar as tabelas no mesmo arquivo .xlsx ??
And to save the tables in the same .xlsx file ??
Hi Rafael,
In order to save multiple .xlsx in the same file you can use the XlsxWriter library !
To install it :
!pip install XlsxWriter
And to you use it, here’s the code :
import tabula
import pandas as pd
df = tabula.read_pdf('file_path/file.pdf', pages = 'all')
writer = pd.ExcelWriter('file_multiple_df.xlsx', engine='xlsxwriter')
for i in range(len(df)):
df[i].to_excel(writer, sheet_name='Sheet'+str(i))
writer.save()
You’ll have an excel file with multiple Sheet, each containing a dataframe ! 🙂
Why the read_pdf is not giving me the same number of columns as the pdf for the following file? If I am right, multiple columns are merged together.
https://drive.google.com/file/d/1VRwU8C6I8KzMVz2IVKPkoqrncCcGZHh-/view?usp=sharing