Quickly extract a table from a website in Python

How to extract a table from a website in a single line of Python code ? It’s easy with this Pandas function !

If you work in Data Science, you obviously went through Pandas library !

It’s the standard when you work in Big Data. Pandas allows you to easily manipulate large data sets.

But did you know that you can also extract tables directly from a web page?

Extract a table from a site

Pandas isn’t a simple data manipulation library.

Indeed, it also allows to do Web Scraping : extracting information from web pages.

How ?

You simply have to use the read_html() function by indicating the url of the targeted web page.

This function looks for every tables in a web page and creates a DataFrame for each one of them.

In the example below, we extract information about the economy of the United States on Wikipedia :

import pandas as pd

df = pd.read_html("https://en.wikipedia.org/wiki/Economy_of_the_United_States")

Then we can display the result :

THE PANE METHOD FOR DEEP LEARNING!

Get your 7 DAYS FREE TRAINING to learn how to create your first ARTIFICIAL INTELLIGENCE!

For the next 7 days I will show you how to use Neural Networks.

You will learn what Deep Learning is with concrete examples that will stick in your head.

BEWARE, this email series is not for everyone. If you are the kind of person who likes theoretical and academic courses, you can skip it.

But if you want to learn the PANE method to do Deep Learning, click here :

df[3]

We have directly a DataFrame containing the table of the Wikipedia page !

To know before using

Notice that we have specified index ‘3’ to display the DataFrame.

Indeed the read_html() function looks for all html tags and extracts the information from all of them.

Thus, we do not only retrieve one table, but all the tables contained in the page.

In our case, the table we are interested in is at index ‘3’.

So feel free to browse the DataFrames returned by the read_html function to understand where your table is located !

Sometimes, it happens that the web pages aren’t up to standard. The extracted data might be corrupted. So expect to do some data cleaning once you call this function.

Fortunately for us, in our example the data was already compliant !

The reason for this is that in the mainstay sites of the internet, like Wikipedia, pages are fully structured.

Pandas library isn’t the only library one allowing to do Web Scraping.

BeautifulSoup is a library specialized in this field and enable extraction of any kind of information on a web page. From tables to unstructured data !

We use it in depth in this article where we analyze Elon Musk’s tweets by Artificial Intelligence.

sources :

THE PANE METHOD FOR DEEP LEARNING!

Get your 7 DAYS FREE TRAINING to learn how to create your first ARTIFICIAL INTELLIGENCE!

For the next 7 days I will show you how to use Neural Networks.

You will learn what Deep Learning is with concrete examples that will stick in your head.

BEWARE, this email series is not for everyone. If you are the kind of person who likes theoretical and academic courses, you can skip it.

But if you want to learn the PANE method to do Deep Learning, click here :

Tom Keldenich
Tom Keldenich

Data Engineer & passionate about Artificial Intelligence !

Founder of the website Inside Machine Learning

One comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Enter your email to receive for free

The PANE method for Deep Learning

* indicates required

 

You will receive one email per day for 7 days – then you will receive my newsletter.
Your information will never be given to third parties.

You can unsubscribe in 1 click from any of my emails.

Entre ton email pour recevoir gratuitement
la méthode PARÉ pour faire du Deep Learning


Tu recevras un email par jour pendant 7 jours - puis tu recevras ma newsletter.
Tes informations ne seront jamais cédées à des tiers.

Tu peux te désinscrire en 1 clic depuis n'importe lequel de mes emails.