Send As SMS

Past Matters

Friday, June 30, 2006

Mining The Old Bailey, Part I

I've spent the last few days brushing up the first piece of code I'll be using in my current text-mining project, reshaping my last post's beetle into an effective spider. At this point, though, I should mention that a fair bit of good old-fashioned leg-work went into this stage of the project before I even started writing any code. Afterall, how can you tell your program what to look for if you aren't sure of it yourself? So I poked around The Proceedings of the Old Bailey online and searched for my topic ("animal theft") by hand. This gave me a sense of how search results are displayed as well as the syntax of the search display pages' and trial records' urls. Search results are displayed in sets of ten hits, and each search page's place in the overall set of hits is reflected in the last few characters of each url, which range from zero up through 2,720. I had hoped that I'd be able to view all hits on one page (making it way easier to pull out links to trial records) but no such luck. And since it's always good to start small before going big, I decided to limit my working body of records to only 20--large enough to provide some diversity of content, but small enough to handle quickly and easily. Once everything is up and running smoothly, I plan to expand this body to all 2,726 records housed on the website.
Downloading and storing each search result page as an HTML file was fairly simple using the Perl module LWP. The next step was to find and follow all the links that lead to trial records. I devised a rather lengthy regular expression that recognizes HTML "a href=" tags leading to these records (as opposed to those leading to the "home page" or "next results page", for example) based on the syntax of their urls. (This was in lieu of using the Perl Mech module I mentioned in my last post which, as it turns out, doesn't follow links after all...) It took me a few tries to get the regex just right, but once it recognized all the relevant links, I used LWP again to get and store these trial records as HTML files.
At this point, I have a nice local repository of HTML files on my hard drive. The next stage of the game will be to parse out the HTML tags from the trial content to make for more efficient text analysis (the fun stuff!) later on. Look forward to more on parsing and Perl modules in my next post...

Tags: | | | |

Tuesday, June 27, 2006

Insects and the Internet

As a Digital History Intern, I'm spending the summer months learning how, as an historian, to use the web effectively as a resource for scholarship. So far, this has primarily consisted of completing a few small scale projects centered on understanding basic HTML and Perl, and exploring some of the vast possibilities for digital research and presentation. Recently, I've taken the first steps towards a more ambitious project involving text mining and the fantastic online repository of trial records at The Proceedings of the Old Bailey. This is a complete collection of over 100,000 criminal trials held at the London court between 1674 and 1834. My goal is to download and scrape the 2,726 records categorized as "Animal Theft."

But, as I expected, this is more easily said than done. For one thing, I can't seem to stop reinventing the wheel. Finding it difficult to import existing complex code and modules to my own work, I end up doing more than might be necessary and in a much less effective or elegant manner by starting from scratch. Part of this is because my very modest expertise in Perl (if it can even be called that) lies somewhere between the very basic web resources and introductory literature available to Perl novices, and the more advanced instructional works like the well-known O'Reilly series on programming. So for instance, I know from Bill Turkel's Digitial History Hacks archives that there is a Perl module called WWW-Mechanize which allows one to spider through links on a given web page or site. Predictably, though, I haven't installed it properly (despite trying several times), and so am stymied by my own ineptitude. While I await guidance, I've tried to circumvent this obstacle by approximating a spider using several search results pages from the Old Bailey web database. Although this has returned many of the relevant trial identification numbers I'm interested in, I've had to do a fair bit of coding by hand that the Mech module would have done for me. As it stands, my creation is much closer to an inelegant beetle trapped on its back waving a set of useless legs than to a true spider daintily traversing a trail of links...

Tags: | | | |

Wednesday, June 21, 2006

Ta da...!

It's a little late to be jumping on the blog bandwagon, I know--"Past Matters" was about the fiftieth name I came up with (the rest of my top choices were already taken). But I'm of the mind that it's never too late to start blogging about learning history! Having recently had a very brief introduction to blogging (among other digital topics) at the Center for History and New Media's Doing Digital History Workshop, I thought it was about time to start my own blog. I'm hoping that "Past Matters" will serve as a forum for working through research questions, issues, and conundrums as I embark in the coming months on the long road towards earning a PhD in the history of science (and related topics). For the next few months, though, my posts will focus on my forays into the web as a Digital History Intern. Stay tuned for my next post on mining the Old Bailey...