How To: Use Python to Collect Data from the Web – Part 1: Parsing Data

While conducting research, it is often the case that the data we need for a particular analysis is available online. Unfortunately, not all data repositories have APIs to make this data extraction easy. When this is the case, it is tempting to collect the data in a “brute force” manner, such as copying and pasting from a web page into a spreadsheet, etc. When the number of observation you need to collect is large, however, this method can be extremely time consuming and error-prone. A better approach is to develop a script that will collect and save the data for you, and Python’s powerful web tools make this very easy. In keeping with previous ZIA Python how-to’s, the following tutorial explains how to find, collect and parse data from the web using a combination of Python packages.

The tutorial assumes a general understanding of both Python and HTML, so if you are unfamiliar with either of these languages I suggest referring to these respective introductions. If you are ready to proceed, you will need to download and install the html5lib package, which we will use to parse the HTML and extract the data (note: previously, I have used BeautifulSoup for these tasks, but in developing the code for this tutorial I came to find out that soup is no longer being supported and the community is switching to html5lib. Those of you who are still using soup for HTML parsing should consider transitioning to this more robust parsing platform). All other packages used in this tutorial are in the Python base installation. Let’s begin, here’s the setup:

I am an avid NFL fan, and in particular I am a New York Giants fan. After the draft this spring, I wanted to collect some basic information on all of the Giants’ draft picks, e.g. height, weight, D.O.B. and college where they played. I went to NFL.com and created the following comman-delimited spreadsheet:

The individual player data are only available at each player’s individual profile page, therefore, rather than navigate through each page I decided to write a script that performs the following tasks:

  1. Use the player name data from the spreadsheet to retrieve the unique player profile URL with a search query
  2. Download each player’s profile web page into Python
  3. Parse and clean the data, then store all of it as a Python dict type

In part 2 of the tutorial I cover how to take the parsed data and save it back into the spreadsheet, in the meantime let’s begin:

The first step is to get the player names out of the spreadsheet and into Python so that we can manipulate them as part of our search string at NFL.com. Because the names are stored in a CSV file, we will use Python’s built-in csv module to do this. I have saved the spreadsheet on my hard drive as NYG_DRAFT_PICKS.csv so I will create a function to open the file, retrieve the names, and store them as a list.

We now have all of the players names stored as strings in a the list players. Next we search for these players at NFL.com to retrieve their individual player profile URLs, which is where the data is located that we actually want. This intermediate step is very common when collecting data from the web, as repositories will nearly always have some unique means of identifying their data. Unfortunately, it is often not clear how these identification schema relate to what you want, therefore, you may have to spend some time thinking of an efficient way to gather the data before you can begin. In our case, NFL.com assigns each player a unique ID string, along with a URL based on their name. For example, Rhett Bomar’s player profile URL is:

http://www.nfl.com/players/rhettbomar/profile?id=BOM404041

As the collector, you will have to be cognizant of the pattern used by the web page you are pulling data from, and build that into your script. To accomplish this task, I use Python’s built-in web interface model urllib2. First, I append each player’s name into NFL.com’s search URL, then locate the hyper-link tag on the returned page to extract the player’s profile URL.

Note that on on line 11 the search URL contains a team ID specifier. In fact, each player on this list has a unqiue name among NFL players so this addition is not necessary, however, this may not have been the case. To insure the profile URL that was returned was correct I added it. These types of details are critical when extracting data from the web, especially when your task is large. You always want to be as specific as possible when defining how your data is extracted. Next, in this example I have extracted the hyper-link data by way of a simple string matching loop rather than using the HTML parser used in the next step. I could have used the the parser here as well, but I added this example to highlight the fact that in some cases it is easier to use this more straightforward approach, especially when you are only retrieving a single specific piece of data.

A final important thing to note from this step is the error handling for UnicodeErrors in line 22. Data collected from the web will nearly always be in Unicode format, and Python has very specific ways of dealing with this. If you are unfamiliar with Unicode, or how to handle it in Python, I highly recommend perusing this tutorial before proceeding, as many of the data cleaning steps in this tutorial are left to the reader.

Our final step is to collect and parse the data we were interested in to begin with. To accomplish this, we will loop through each URL collected in the previous step, download the data, parse and clean it, then store it in a dict indexed by each player’s name and the data point of interest. This step is often the most involved, and as the collector you must become very good at recognizing patterns in the HTML that can be exploited to extract the data you are looking for.

The first thing you will note is the use of a BeautifulSoup tree type in the parser declaration in line 12. This is used for my own personal convenience because I am familiar with this tree type from previous work, however, html5lib has several other tree types that may be of more use to you. We are fortunate in this case because the data of interest are the only pieces in the parse tree within a p HTML tag (again, examining the HTML for patterns is critical), so on line 15 I simply tell the parser to return all child nodes index by this tag and the results are my data. I do not need everything that is returned, so in the next line I copy only the data from the data list that I need, in this case index 2:5.

At this point we have everything we need, but there is still a great deal of data cleaning that must be completed before we are done. In this case, each piece of data (height, weight, D.O.B and college) all have slightly different patterns of surrounding character that must be eliminated; therefore, I create three small helper functions to clean the data (starting at line 49). Each of these function returns only data I am interested in and then stores them back into a sub-dict indexed by the data point of interest.

The resulting dict is organized like this:

{'Travis Beckum': {'dob': u'1/24/1987', 'college': u'Wisconsin',
'weight': 243, 'height': u'6-3'}, 'Andre Brown': {'dob': u'12/15/1986',
'college': u'North Carolina State', 'weight': 224, 'height': u'6-0'},...}

We are done (with collecting the data, that is)! We have downloaded all of the player statistics we were interested in and stored them as a Python dict, which will be used later to save the data back into the spreadsheet. In part 2 of this tutorial I review how to accomplish this and finish our data collection task. I hope you have found this tutorial informative, and will use it to build confidence in how to use Python to extract data from the web. As always, I welcome your questions and comments, and all of the code for this tutorial is available in the ZIA code repository as web_scrapper_howto.py, and feel free to download and use it.

Photo: New York Super Blog

  • Share/Save/Bookmark

Automatically Generated Related posts:

  1. How to: Use Python to Collect Data from the Web – Part 2: Saving the Data
  2. How to: Perform Multivariate Regression with Python
  3. How to: Use Python to Solve Optimization Problems
  4. How to: Use Python and Social Network Analysis to Find New Twitter Friends
  5. How Python Can Help Political Methodologist

9 comments to How To: Use Python to Collect Data from the Web – Part 1: Parsing Data

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Technorati Profile