In this article, we’ll look at how to quickly extract a table from a PDF to Excel using the Python language.
For this tutorial you will need two Python libraries :
- tabula-py
- pandas
To install them, go to your terminal/shell and type these lines of code:
pip install tabula-py
pip install pandas
If you use Google Colab, you can install these libraries directly on it. You just have to add an exclamation mark “!” in front of it, like this:
!pip install tabula-py
!pip install pandas
[smartslider3 slider=”10″]
Photo by Aurelien Romain on Unsplash
PDF to Excel (one table only) – Python
First we load the libraries into our text editor :
import tabula
import pandas as pd
Then, we will read the pdf with the read_pdf() function of the tabula library.
This function automatically detects the tables in a pdf and converts them into DataFrames. Ideal for converting them into Excel files!
df = tabula.read_pdf('file_path/file.pdf', pages = 'all')[0]
We can then check that the table has the expected shape.
df.head()
Then convert it to an Excel file !
df.to_excel('file_path/file.xlsx')
The entire code :
By the way, if your goal is to master Deep Learning - I've prepared the Action plan to Master Neural networks. for you.
7 days of free advice from an Artificial Intelligence engineer to learn how to master neural networks from scratch:
- Plan your training
- Structure your projects
- Develop your Artificial Intelligence algorithms
I have based this program on scientific facts, on approaches proven by researchers, but also on my own techniques, which I have devised as I have gained experience in the field of Deep Learning.
To access it, click here :
Now we can get back to what I was talking about earlier.
import tabula import pandas as pd
df = tabula.read_pdf('file_path/file.pdf', pages = 'all')[0]
df.to_excel('file_path/file.xlsx')
PDF containing several tables
We load the libraries in our text editor :
import tabula
import pandas as pd
Then, we will read the pdf with the read_pdf() function of the tabula library.
This function automatically detects the tables in a pdf and converts them into DataFrames. Ideal to convert them then in Excel file !
Here, the variable df will be in fact a list of DataFrame. The first element corresponds to the first table, the second to the second table, etc.
df = tabula.read_pdf('file_path/file.pdf', pages = 'all')
To save these tables separately, you will have to use a for loop that will save each table in an Excel file.
for i in range(len(df)):
df[i].to_excel('file_'+str(i)+'.xlsx')
The entire code :
import tabula
import pandas as pd
df = tabula.read_pdf('file_path/file.pdf', pages = 'all')
for i in range(len(df)):
df[i].to_excel('file_'+str(i)+'.xlsx')
Using Artificial Intelligence, you’ve managed to extract a table from a PDF into Excel.
But that’s only an ounce of what this technology can achieve!
Today, it’s thanks to Deep Learning that tech leaders can create the most powerful AIs.
If you want to deepen your knowledge in the field, you can access my Action plan to Master Neural networks.
A program of 7 free courses that I’ve prepared to guide you on your journey to learn Deep Learning.
If you’re interested, click here:
sources:
- Medium
- Photo by Birger Strahl on Unsplash
One last word, if you want to go further and learn about Deep Learning - I've prepared for you the Action plan to Master Neural networks. for you.
7 days of free advice from an Artificial Intelligence engineer to learn how to master neural networks from scratch:
- Plan your training
- Structure your projects
- Develop your Artificial Intelligence algorithms
I have based this program on scientific facts, on approaches proven by researchers, but also on my own techniques, which I have devised as I have gained experience in the field of Deep Learning.
To access it, click here :
E para salvar as tabelas no mesmo arquivo .xlsx ??
And to save the tables in the same .xlsx file ??
Hi Rafael,
In order to save multiple .xlsx in the same file you can use the XlsxWriter library !
To install it :
!pip install XlsxWriter
And to you use it, here’s the code :
import tabula
import pandas as pd
df = tabula.read_pdf('file_path/file.pdf', pages = 'all')
writer = pd.ExcelWriter('file_multiple_df.xlsx', engine='xlsxwriter')
for i in range(len(df)):
df[i].to_excel(writer, sheet_name='Sheet'+str(i))
writer.save()
You’ll have an excel file with multiple Sheet, each containing a dataframe ! 🙂
Why the read_pdf is not giving me the same number of columns as the pdf for the following file? If I am right, multiple columns are merged together.
https://drive.google.com/file/d/1VRwU8C6I8KzMVz2IVKPkoqrncCcGZHh-/view?usp=sharing