Categories
Coding HTML Pandas Python Tutorial

Scraping HTML Tables using Python with lxml.html and Requests

I’m going to be looking at using lxml.html along with Requests to scrape some data from HTML tables from the internet.

I personally find this usually as I do some bike racing and the league tables are available online. However due to the number of riders these can be difficult to copy and paste into my own tables. This is where scraping the HTML tables comes in super handy. I get the data from the web page and usually export it into a CSV file using Pandas for manipulating later.

Let’s begin with our imports. These will include lxml.html, Requests and Pandas for passing the data into a DataFrame.

The Imports for Python

import requests
import lxml.html as html
import pandas as pd

Webpage of HTML Table

Next we will use some previous racing data I have for our HTML table. We will storage this into variable called url.

url = "http://results.smartiming.co.uk/view-race/cxnederwentsideccfoxberry2020senmv40/"


Storing the HTML Content of the Page

We now need to use Requests to get the information from that link. After that we will use lxml.html to get the HTML of the page we want to scrape.

page = requests.get(url)
content = html.fromstring(page.content)

Finding the HTML Table

So what most tables online have in common are <tr> tags. These are the rows of the table. We will then use this knowledge to scan our HTML code and look for these rows.

tr_elements = content.xpath('//tr')

Storing the Header Columns of the Table

The data we will pull first from our table is the headers of the columns. We will storage these in an array called column_headers. Since these are found at the top, or our first item, we will use an index of 0 for our tr_elements.

for column in tr_elements[0]:
    name = column.text_content()
    column_headers.append((name, []))

What we’re doing above is getting the first columns text (text_content()) and then appending that into our column_headers array.

Storing the Remaining Columns of the Table

Now it’s time to get the remaining data into our columns headers. We will be more or less be doing the same as above but we will have an additional for loop inside. This is because the first for loop will go through all the rows. While our second will go through all of the columns.

Let’s begin our row for loop. Since we have already got the headers using the above for loop, we no longer need this data. So we will use the range function to skip this by adding a 1. Remember that 0 is the first row. We will then get the length of the elements in our tr_elements variable after 0.

for row in range(1, len(tr_elements)):

Next we need to add an additional for loop so that it cycles through our columns. But before this we will store the data from our row in a variable as well as setting up a counter.

table_tr = tr_elements[row]
column_count = 0

for column in table_tr.iterchildren():
    data = column.text_content()
    column_headers[column_count][1].append(data)
    column_count += 1

So the above code is using the index of the row and then going along the column user our column_count variable. The data from the column is then getting append[ed] to our column_headers array from earlier.

Saving the Data into a DataFrame

Now that we have all of the tables data stored in our column_headers array, we can now add it to a DataFrame.

dictionary = {title : column for (title, column) in column_headers}
df = pd.DataFrame(dictionary)

That’s pretty much all there is to it. From here you could use the to_csv() function in Pandas to export it for use later.

Full Code Example

import requests
import lxml.html as html
import pandas as pd

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 100)

url = "http://results.smartiming.co.uk/view-race/cxnederwentsideccfoxberry2020senmv40/"
page = requests.get(url)
content = html.fromstring(page.content)
tr_elements = content.xpath('//tr')

column_headers = []

for column in tr_elements[0]:
    name = column.text_content()
    column_headers.append((name, []))

for row in range(1, len(tr_elements)):
    table_tr = tr_elements[row]
    
    column_count = 0
    
    for column in table_tr.iterchildren():
        data = column.text_content()
        column_headers[column_count][1].append(data)
        column_count += 1

dictionary = {title : column for (title, column) in column_headers}
df = pd.DataFrame(dictionary)