Mining The Old Bailey, Part I
I've spent the last few days brushing up the first piece of code I'll be using in my current text-mining project, reshaping my last post's beetle into an effective spider. At this point, though, I should mention that a fair bit of good old-fashioned leg-work went into this stage of the project before I even started writing any code. Afterall, how can you tell your program what to look for if you aren't sure of it yourself? So I poked around The Proceedings of the Old Bailey online and searched for my topic ("animal theft") by hand. This gave me a sense of how search results are displayed as well as the syntax of the search display pages' and trial records' urls. Search results are displayed in sets of ten hits, and each search page's place in the overall set of hits is reflected in the last few characters of each url, which range from zero up through 2,720. I had hoped that I'd be able to view all hits on one page (making it way easier to pull out links to trial records) but no such luck. And since it's always good to start small before going big, I decided to limit my working body of records to only 20--large enough to provide some diversity of content, but small enough to handle quickly and easily. Once everything is up and running smoothly, I plan to expand this body to all 2,726 records housed on the website.
Downloading and storing each search result page as an HTML file was fairly simple using the Perl module LWP. The next step was to find and follow all the links that lead to trial records. I devised a rather lengthy regular expression that recognizes HTML "a href=" tags leading to these records (as opposed to those leading to the "home page" or "next results page", for example) based on the syntax of their urls. (This was in lieu of using the Perl Mech module I mentioned in my last post which, as it turns out, doesn't follow links after all...) It took me a few tries to get the regex just right, but once it recognized all the relevant links, I used LWP again to get and store these trial records as HTML files.
At this point, I have a nice local repository of HTML files on my hard drive. The next stage of the game will be to parse out the HTML tags from the trial content to make for more efficient text analysis (the fun stuff!) later on. Look forward to more on parsing and Perl modules in my next post...
Tags: the old bailey | digital history | perl | perl modules | LWP
Downloading and storing each search result page as an HTML file was fairly simple using the Perl module LWP. The next step was to find and follow all the links that lead to trial records. I devised a rather lengthy regular expression that recognizes HTML "a href=" tags leading to these records (as opposed to those leading to the "home page" or "next results page", for example) based on the syntax of their urls. (This was in lieu of using the Perl Mech module I mentioned in my last post which, as it turns out, doesn't follow links after all...) It took me a few tries to get the regex just right, but once it recognized all the relevant links, I used LWP again to get and store these trial records as HTML files.
At this point, I have a nice local repository of HTML files on my hard drive. The next stage of the game will be to parse out the HTML tags from the trial content to make for more efficient text analysis (the fun stuff!) later on. Look forward to more on parsing and Perl modules in my next post...
Tags: the old bailey | digital history | perl | perl modules | LWP