Scraping football data (or soccer in the US) is a great way to build a comprehensive dataset that can then be used to help us create stats dashboards, run cross-analysis and use the insights for sports journalism or fantasy leagues.
Whatever your goal is, scraping football data can help you gather all the information you need to make it a reality. In this tutorial, we’re going to build a simple football data scraper with Axios and Cheerio in Node.js.
A Few Things to Consider Before Starting
Although you’ll be able to follow along with this tutorial without any prior knowledge, you’ll need a little experience in web scraping to make the most out of it and use this information in other contexts.
Building a Football Dataset with Web Scraping
1. Picking a Data Source
Although there are fundamental principles that apply to every web scraping project, every website is a unique puzzle to be solved. So is choosing where you’ll get your data from and how exactly you’ll use it.
Here are a few sources you could use to obtain football data:
For simplicity’s sake, we’ll be scraping BBC’s premier league table page, as it has all the information we’ll be scraping directly inside the HTML document, and because a lot of the data found on these websites are in the form of tables, making it crucial to learn how to scrape them effectively.
2. Understanding HTML Tables
Tables are a great way to organize content and display a higher view of datasets, allowing us humans to understand it easier.
At first glance, every table is made up of two major elements: columns and rows. However, the HTML structure of tables is a little more complex than that.
Tables start with a
tag, telling the browser to render everything inside of it as a table. To define a row we use the
for each table cell inside a
. Something else we’ll see a lot are headers, which are represented by a
tag within the rows.
Here’s an example from W3school:
Let’s explore our target table with this structure in mind by inspecting the page.
The code is a little messy, but this is the kind of HTML file you’ll find in the real world. Despite the mess, it still respects the
the structure we discussed above.
So what’s the best approach to scraping the football data in this table? Well, if we can create an array containing all rows, we can then iterate through each of them and grab the text from every cell. Well, that’s the hypothesis anyway. To confirm it, we’ll need to test a few things in our browser’s console.
3. Using Chrome’s Console for Testing
There are a couple of things we want to test. The first thing to test is whether the data resides in the HTML file, or if it’s being injected into it. We’ve already told you it does reside in the HTML file, but you’ll want to learn how to verify it on your own for future football scraping projects on other sites.
The easiest way to see whether or not the data is being injected from elsewhere is to copy some of the text from the table – in our case, we’ll copy the first team – and look for it in the page source.
Do the same thing with a few more elements to be sure. The page source is the HTML file before any rendering happens, so you can see the initial state of the page. If the element is not there, that means the data is being injected from elsewhere and you’ll need to find another solution to scrape it.
The second thing we want to test before coding our scraper is our selectors. For this, we can use the browser’s console to select elements using the .querySelectorAll() method, using the element and class we want to scrape.
The first thing we want to do is select the table itself.
Finally, we’ll select all the
elements in the table: document.querySelectorAll(“table > tbody > tr”)
Awesome, 20 nodes! It matches the number of rows we want to scrape, so we now know how to select them with our scraper.
Note: Remember that when we have a node list, the count starts at 0 instead of 1.
The only thing missing is learning the position of each element in the cell. Spoiler alert, it goes from 2 to 10.
Awesome, now we’re finally ready to go to our code editor.
4. Getting Our Environment Ready
To begin the project, create a new directory/folder and open it in VScode or your preferred code editor. We’ll install Node.js and NPM first, then open your terminal and start your project with the command npm -y init.
Then, we’ll install our dependencies using NPM:
5. Sending the Initial Request
We’ll want to build our scraper as an async function so we can use the await keyword to make it faster and more resilient. Let’s open our function and send the initial request using Axios by passing the URL to it and storing the result in a variable called response. To test if it’s working, log the status of the response.
Run it with node index.js, and in less than a second, you’ll have a CSV file ready to use.
You can use the same process to scrape virtually any HTML table you want and grow a huge football dataset for analytics, result forecasting, and more.
Integrate ScraperAPI into Your Football Web Scraper
Let’s add a layer of protection to our scraper by sending our request through ScraperAPI. This will help us scrape several pages without risking getting our IP address blocked or being permanently banned from those pages.
Our request may take a little longer, but in return, it will rotate our IP address automatically for every request, use years of statistical analysis and machine learning to determine the best header combination, and handle any CAPTCHA that might appear.