Jump to: navigation, search

This page will give you a brief overview of how to use the LOCKSS Plugin Tool. See the Plugin Tool page for download information.

Contents

[edit] Before You Start Clicking

When the generator starts you will see a window which contains the fields needed to define the plugin, like this.

The pull-down menus for this window are:

File -> New, Open, Save, Save As, Exit
These allow you to load and save the XML files defining the plugin.
Edit -> Cut, Copy, Paste, Delete
These operate on editable text fields.
Plugin -> Expert Mode, Test Crawl Rules, Test Filters, Validate Plugin
These are described below.
Help -> About
Brings up version and copyright info.

Expert Mode allows more control over the plugin, the details are here. In most cases you will not need it. Instead you will define your plugin by filling in the fields in the window in turn. They are:

Plugin Name
The human-readable name that the plugin will appear under in the configuration menu of the LOCKSS administrative user interface. Spaces are OK here.
Plugin ID
The Java class name for the plugin.
Plugin Version
A version number that differentiates different versions of the same plugin. It should start at 1.
Configuration Parameters
This defines the set of parameters that the plugin will use to identify and differentiate between each AU that uses the plugin.
Plugin Notes
Textual notes that are attached to the plugin.
Start URL Template
This template tells the plugin where on the publisher's web site to find the publisher manifest page for each AU of the journal.
AU Name Template
This template tells the plugin how to use the Configuration Parameters to create the name of each AU using the plugin used by the LOCKSS administrative UI.
Crawl Rules
These rules define the boundaries of an archival unit (AU) in the journal's web site. An AU is normally a year's run or a volume of the journal.
Pause Time Between Fetches
The time for which the LOCKSS daemon waits after fetching each page from the publisher's web site.
New Content Crawl Interval
The time between attempts by the LOCKSS daemon to find new content on the publisher's web site.

[edit] Example Journal

We will use the real journal Disputatio as an example. Its home page is at http://disputatio.com. Its publisher manifest page for 2004 is at http://disputatio.com/lockss2004.html. A typical article from 2004 is at http://disputatio.com/articles/016-3.pdf - it is the third article in issue number 16 of the journal. That's all the information we need to get started.

[edit] Defining a Basic Plugin

  • We will call the Disputatio plugin Disputatio Plugin. Type that into the Plugin Name field.
  • The Plugin ID for the plugin will be the reverse of your institution's DNS domain followed plugin followed by a class name. For example, ours will be called org.lockss.plugin.DisputatioPlugin. Click on the UNKNOWN and type it, substituting your DNS name for org.lockss.
  • We are defining version 1 of the Disputatio plugin, so make sure the version field is 1.
  • Now click on the row of dots beside Configuration Parameters. A window pops up looking like this. In this case, the only parameters we need the administrator to define are the base_url, which is always required so that the plugin can find the journal (it will be http://disputatio.com/), and the year (it will be 2004, etc.), so it can find the publisher manifest pages (at http://disputatio.com/lockss2004.html, etc.). Select year and click Add.
  • The window should look like this. Now click OK.
  • Now we use these parameters to define the Start URL Template, which tells the plugin where to find the publisher manifest page. Click on the NONE beside it. A template editor window pops up, looking like this.
  • Choose Base URL from the pull-down menu and click Insert Parameter.
  • Choose String Literal and click Insert Parameter; type lockss into the popup then click OK.
  • Choose Year from the pull-down menu and click Insert Parameter.
  • Choose String Literal and click Insert Parameter; type .html into the popup then click OK.
  • The Editor View tab shows you the resulting format string into which the parameters in blue are substituted to obtain, in our case, http://disputatio.com/lockss2004.html (etc.). Click Save.
  • Next we specify the name that each AU captured by this plugin will have in the LOCKSS Administrative user interface. Click on the NONE beside AU Name Template to get a template editor window that looks like this.
  • Suppose we want the name of the 2004 AU to be Disputatio 2004 (http://disputatio.com), i.e. the word Disputatio followed by a space followed by the year followed by a spcae followed by a parenthesis followed by the base URL followed by a parenthesis.
  • Choose String Literal from the pull-down menu, click Insert Parameter, enter Disputatio (i.e. the word Disputation followed by a space) in the popup window, then click OK.
  • Choose Year from the pull-down menu, click Insert Parameter.
  • Choose String Literal from the pull-down menu, click Insert Parameter, enter ( (i.e. a space followed by a parenthesis), then click OK.
  • Choose Base URL from the pull-down menu, click Insert Parameter.
  • Choose "String Literal" from the pull-down menu, click Insert Parameter, type ) in the popup, then click OK.
  • The window looks like this.
  • Click Save.
  • Next we need to specify a series of crawl rules that the crawler applies in order to each URL, starting with the publisher manifest page's URL and continuing with the URLs found in it, and the URLs found in those pages, and so on. Click on the row of dots beside to get a window that looks like this. The left column contains actions that are taken if the URL matches the pattern in the right column. Two rules are pre-specified for you; they will be needed in almost every case. The first says that anything that doesn't start with base_url (in our case, http://disputatio.com/) will be excluded. The second says that the publisher manifest page will be included. The rules are in the form of a "printf" format string which has the configuration parameters for the AU substituted into it before being used as a regular expression pattern.
  • Regular expressions are full of pitfalls for the unwary. They use many meta characters with special meanings (the ^ at the beginning of the first rule is an example, it forces the pattern to be matched only if it starts at the beginning of the URL). So the combination of format string and regular expression would be hard to type by hand. The tool makes this easy by building them up in stages. If you really want the gory details, look here.
  • Now we want to make sure that any articles linked from the manifest page are included. A typical URL we want to include is http://disputatio.com/articles/016-3.pdf -- so we want the pattern to match the base_url followed by the text articles/ followed by anything followed by .pdf.
  • Click Add to get a default rule to edit. It will have the Include action, which is what we want, and the NONE pattern, which we edit by clicking on it to get a window that looks like this.
  • Choose Base URL from the Insert Parameter pull-down menu and click on Insert Parameter to start the pattern with base_url, like this.
  • Now pull down the menu by Insert Match and choose String Literal. Click Insert Match and a box pops up into which you type the text you want to match, in our case articles/.
  • Click OK and the pattern editor window looks like this.
  • Next we want to match anything, so pull down the menu by Insert Match, choose Anything and click Insert Match. The pattern editor window looks like this.
  • Note the .* you just inserted. This is a regular expression made of two meta-characters. The period matches any single character, the star means any sequence of characters matching the previous character which, because it is a period, matches anything. So .* matches any string of characters, specifically in our case strings like 016-3.
  • Next we want to match .pdf. Pull down the menu by Insert Match, choose String Literal and click Insert Match.
  • A box pops up into which you type .pdf.
  • Click OK. The pattern editor window now looks like this.
  • Note the backslash (\) before the period before pdf. Period is a regular expression meta-character, so it must be escaped with a backslash beforehand if it is to match an actual period. The tool generates these "escape" meta-characters automatically.
  • Now we're done with this rule, so click Save. The Crawl Rule window now looks like this.
  • We're done for now, so click OK. The base plugin editor window now looks like this.
  • We are now ready to start testing the plugin. Pull down the Plugin menu and choose Test Crawl Rules....
  • A window pops up that allows you to provide the configuration parameters you decided earlier that were needed to identify an AU. Type values for the configuration parameters, in our case base_url is http://disputatio.com/ and year is 2004. Note the trailing / on the base_url.
  • Click Check AU.
  • Now a window looking like this pops up with pre-loaded fields for the publisher manifest page you want to start testing from (in our case http://disputatio.com/lockss2004.html), the Test Depth (a file linked from the manifest page has a depth of 1, a page linked from that page has a depth of 2, and so on) and Fetch Delay (the number of seconds to pause between fetching pages).
  • Click Check URL to start the test. The test results appear in the scrolling text pane.
  • In our case the results are:
Checking http://disputatio.com/lockss2004.html
crawl depth: 1     crawl delay: 6000 ms.

Depth 1

Fetching http://disputatio.com/lockss2004.html
Type: text/html, extracting Urls

Included Urls: (5 new, 0 old)
http://disputatio.com/articles/016-1.pdf
http://disputatio.com/articles/016-2.pdf
http://disputatio.com/articles/016-3.pdf
http://disputatio.com/articles/016-4.pdf
http://disputatio.com/articles/016-5.pdf

Excluded Urls: (5 new, 0 old)
http://disputatio.com/footer.js
http://disputatio.com/header.js
http://disputatio.com/images/lockss-logo-small.gif
http://disputatio.com/style.css
http://lockss.stanford.edu/


Summary for starting Url: http://disputatio.com/lockss2004.html and depth: 1

Urls fetched: 1    Urls extracted: 10

Depth  Fetched  Parsed  New URLs
    1        1       1         5

Remaining unfetched: 5

Elapsed Time: 0 secs.    Fetch Rate: 0 p/m
  • You will see that our rules correctly included all the articles and correctly excluded http://lockss.stanford.edu/, but there were some files excluded that should have been included. We need to add rules including base_url followed by anything followed by .js, base_url followed by anything followed by .css and base_url followed by images/ followed by anything followed by .gif.
  • Click on the row of dots by Crawl Rules and follow the same process you used to insert the articles/ rule to insert these rules. At the end, the Crawl Rule window should look like this.
  • Click OK and test this again. The results pane should contain this:
Checking http://disputatio.com/lockss2004.html
crawl depth: 1     crawl delay: 6000 ms.

Depth 1

Fetching http://disputatio.com/lockss2004.html
Type: text/html, extracting Urls

Included Urls: (9 new, 0 old)
http://disputatio.com/articles/016-1.pdf
http://disputatio.com/articles/016-2.pdf
http://disputatio.com/articles/016-3.pdf
http://disputatio.com/articles/016-4.pdf
http://disputatio.com/articles/016-5.pdf
http://disputatio.com/footer.js
http://disputatio.com/header.js
http://disputatio.com/images/lockss-logo-small.gif
http://disputatio.com/style.css

Excluded Urls: (1 new, 0 old)
http://lockss.stanford.edu/


Summary for starting Url: http://disputatio.com/lockss2004.html and depth: 1

Urls fetched: 1    Urls extracted: 10

Depth  Fetched  Parsed  New URLs
    1        1       1         9

Remaining unfetched: 9

Elapsed Time: 0 secs.    Fetch Rate: 0 p/m
  • We have correctly included the articles and the other files.
  • Disputatio publishes two issues per year, so checking every three weeks is excessive. Set the New Content Crawl Interval to 13 weeks.
  • Finally, save your plugin as an XML file using Save or Save As. Submit the resulting file to the LOCKSS Team (via email to Image:EmailLockssSupportBold.gif) for testing and inclusion in the distribution.

[edit] Improving the Design of the Example Plugin

Although the plugin we just defined includes and excludes the correct files, it does so by explicitly specifying the files on the publisher's site to include. In most cases it will be better to explicitly specify the files to exclude, and include everything else. This approach is more likely to work as the publisher tweaks their site. Which files on the Disputatio site do we know should be excluded? The only ones we are sure should be excluded are the "home page" (we know this will change with every new issue) and the publisher manifest pages for other AUs (e.g the 2004 AU should not include the 2003 publisher manifest page). A better plugin design would have rules that:

  • Exclude everything not starting with the base_url http://disputatio.com/ (the first default rule).
  • Include the publisher manifest page for this AU (the second default rule).
  • Exclude the publisher manifest pages from other AUs, in our case everything matching the base_url followed by the string literal lockss followed by a number followed by .html. Although this will match the publisher manifest page for this AU, that page has already been included so this exclusion rule will not affect it.
  • Exclude the home page under both its aliases. First as http://disputatio.com/, i.e. a pattern starting with base_url followed by the end of the string. Second as http://disputatio.com/index.html, i.e. a pattern starting with base_url followed by the string literal index.html.

Restart the plugin tool and fill in the fields as before until you get to the Crawl Rules. As before, when you click on the row of dots by Crawl Rules, the window that pops up will have the first two rules pre-defined. Now add the extra exclusion rules outlined above:

  • Exclude the publisher manifest pages for other AUs by clicking Add, pulling down the menu from Include and choosing Exclude. Click on NONE and choose:
    • Start, Insert Match
    • Base URL, Insert Parameter
    • String Literal, Insert Match, lockss, OK
    • Any Number, Insert Match
    • String Literal, Insert Match, .html, OK
    • End, Insert Match
    • Save
  • Exclude the home page under its http://disputatio.com/ alias by clicking Add, pulling down the menu from Include and choosing Exclude. Click on NONE and choose:
    • Start, Insert Match
    • Base URL, Insert Parameter
    • End, Insert Match
    • Save
  • Exclude the home page under its http://disputatio.com/index.html alias by clicking Add, pulling down the menu from Include and choosing Exclude. Click on NONE and choose:
    • Start, Insert Match
    • Base URL, Insert Parameter
    • String Literal, Insert Match, index.html, OK
    • End, Insert Match
    • Save
  • The Crawl Rule window will look like this:. Note the backslashes before the periods in the .html string literals in the patterns -- the tool automatically inserts these to prevent the period being interpreted as a regular expression meta-character.
  • The result of testing this set of rules is:
Checking http://disputatio.com/lockss2004.html
crawl depth: 1     crawl delay: 6000 ms.

Depth 1

Fetching http://disputatio.com/lockss2004.html
Type: text/html, extracting Urls

Excluded Urls: (10 new, 0 old)
http://disputatio.com/articles/016-1.pdf
http://disputatio.com/articles/016-2.pdf
http://disputatio.com/articles/016-3.pdf
http://disputatio.com/articles/016-4.pdf
http://disputatio.com/articles/016-5.pdf
http://disputatio.com/footer.js
http://disputatio.com/header.js
http://disputatio.com/images/lockss-logo-small.gif
http://disputatio.com/style.css
http://lockss.stanford.edu/


Summary for starting Url: http://disputatio.com/lockss2004.html and depth: 1

Urls fetched: 1    Urls extracted: 10

Depth  Fetched  Parsed  New URLs
    1        1       1         0

Remaining unfetched: 0

Elapsed Time: 0 secs.    Fetch Rate: 0 p/m
  • Only the publisher manifest page is included, because the default action if no rules are matched is to exclude the URL. We need to add a rule to include everything that isn't being explicitly excluded. Click Add.
  • Click on NONE and choose:
    • Start, Insert Match
    • Base URL, Insert Parameter
    • Anything, Insert Match
  • Now the Crawl Rule window will look like this. You may need to expand the window to see it all.
  • The result of testing this set of crawl rules is:
Checking http://disputatio.com/lockss2004.html
crawl depth: 1     crawl delay: 6000 ms.

Depth 1

Fetching http://disputatio.com/lockss2004.html
Type: text/html, extracting Urls

Included Urls: (9 new, 0 old)
http://disputatio.com/articles/016-1.pdf
http://disputatio.com/articles/016-2.pdf
http://disputatio.com/articles/016-3.pdf
http://disputatio.com/articles/016-4.pdf
http://disputatio.com/articles/016-5.pdf
http://disputatio.com/footer.js
http://disputatio.com/header.js
http://disputatio.com/images/lockss-logo-small.gif
http://disputatio.com/style.css

Excluded Urls: (1 new, 0 old)
http://lockss.stanford.edu/


Summary for starting Url: http://disputatio.com/lockss2004.html and depth: 1

Urls fetched: 1    Urls extracted: 10

Depth  Fetched  Parsed  New URLs
    1        1       1         9

Remaining unfetched: 9

Elapsed Time: 0 secs.    Fetch Rate: 0 p/m
  • This is correct, and less likely than the original example to break as the publisher changes their site.

[edit] A More Complex Example Journal

Warning
The Early Modern Literary Studies website is in the process of changing to a new server (extra.shu.ac.uk instead of www.shu.ac.uk). The instructions below are valid in principle, if you replace references to the old server by references to the new server. However, not all links have been fixed yet so the crawls will not succeed as described below. This section will be updated when the EMLS website is fixed to reflect their new URLs.

We will use Early Modern Literary Studies (EMLS) as an example of a somewhat more complex journal and thus a somewhat more complex plugin. Its home page and the base_url is at http://www.shu.ac.uk/emls/. The publisher manifest page for Volume 9 is at http://www.shu.ac.uk/emls/lockss-volume9.html. EMLS normally publishes every 4 months but it also publishes "special editions" on an irregular schedule. Each volume has a table of contents page, for example http://www.shu.ac.uk/emls/09-3/09-3toc.htm. The articles in the volume are at URLs like http://www.shu.ac.uk/emls/09-3/finntabl.htm, i.e. the directory (folder) emls/09-3 contains the pages for volume 9 number 3. We will pretend that this plugin is being developed at Sheffield Hallam University (http://www.shu.ac.uk)

  • Start the tool. Set the Plugin Name to Early Modern Literary Studies Plugin and the Plugin ID to uk.ac.shu.plugin.EMLSPlugin.
  • Pop up the Configuration Parameters dialog. Add volume to the set of configuration parameters, we need it because it is part of the publisher manifest page URL.
  • Pop up the Start URL dialog. Insert Base URL, type lockss-volume, insert Volume, type .html.
  • When you insert Volume you will be asked to set the padding. In this case we need "do not pad" to get, in our case, just the 9.
  • Pop up the AU Name Template dialog. Type Early Modern Literary Studies Volume (with a space at the end), then insert Volume and set the padding to "do not pad", then type ( (with a space just before the parenthesis), then insert Base URL, then type ). type a space, then vol, then a space; insert volume. The AU name will look like Early Modern Literary Studies Volume 9 (http://www.shu.ac.uk/emls/).
  • Pop up the Crawl Rules dialog. Include everything that matches Start followed by Base URL followed by Volume padded with zeros to a field width of 2 followed by a literal string of - followed by Any Number followed by a string literal of / followed by Anything followed by End.
  • Now test this crawl rule. Configure the AU with the Base URL http://www.shu.ac.uk/emls/ and the Volume 9. Start with the default depth of 1.
  • The results of this test are as follows:
Checking http://www.shu.ac.uk/emls/lockss-volume9.html
crawl depth: 1     crawl delay: 6000 ms.

Depth 1

Fetching http://www.shu.ac.uk/emls/lockss-volume9.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (3 new, 0 old)
http://www.shu.ac.uk/emls/09-1/09-1toc.htm
http://www.shu.ac.uk/emls/09-2/09-2toc.htm
http://www.shu.ac.uk/emls/09-3/09-3toc.htm

Excluded Urls: (6 new, 0 old)
http://www.shu.ac.uk/emls/LOCKSS.logo.1%27.gif
http://www.shu.ac.uk/emls/button.gif
http://www.shu.ac.uk/emls/emlscopy.html
http://www.shu.ac.uk/emls/emlshome.html
http://www.shu.ac.uk/emls/header2.gif
http://www.shu.ac.uk/emls/si-13/si-13toc.htm


Summary for starting Url: http://www.shu.ac.uk/emls/lockss-volume9.html and depth: 1

Urls fetched: 1    Urls extracted: 9

Depth  Fetched  Parsed  New URLs
    1        1       1         3

Remaining unfetched: 3

Elapsed Time: 1 secs.    Fetch Rate: 54 p/m
  • Note that the rules have included the table of contents pages for the three normal issues of Volume 9, but not the table of contents page for the special issue 13. They have also excluded some decorations that should be collected. This test did not collect any articles, but we're not yet sure whether the rules are to blame. The articles are linked from a page linked from the publisher manifest page, so they will only be collected at depth 2 or more.
  • Pop up the Crawl Rules dialog and add a rule that includes the special issue table of contents for the special issue by including anything that matches Start followed by base_url followed by the string literal si- followed by Any Number followed by the string literal / followed by Anything followed by End.
  • Now test these crawl rules. Configure the AU with the base_url http://www.shu.ac.uk/emls/ and the volume 9. Set the depth to 2. The test will take a while as it waits the 6 seconds between fetching each page.
  • Here are the results:
Checking http://www.shu.ac.uk/emls/lockss-volume9.html
crawl depth: 2     crawl delay: 6000 ms.

Depth 1

Fetching http://www.shu.ac.uk/emls/lockss-volume9.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (4 new, 0 old)
http://www.shu.ac.uk/emls/09-1/09-1toc.htm
http://www.shu.ac.uk/emls/09-2/09-2toc.htm
http://www.shu.ac.uk/emls/09-3/09-3toc.htm
http://www.shu.ac.uk/emls/si-13/si-13toc.htm

Excluded Urls: (5 new, 0 old)
http://www.shu.ac.uk/emls/LOCKSS.logo.1%27.gif
http://www.shu.ac.uk/emls/button.gif
http://www.shu.ac.uk/emls/emlscopy.html
http://www.shu.ac.uk/emls/emlshome.html
http://www.shu.ac.uk/emls/header2.gif

Depth 2

Fetching http://www.shu.ac.uk/emls/09-1/09-1toc.htm
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (27 new, 0 old)
http://www.shu.ac.uk/emls/09-1/abstracts.htm
http://www.shu.ac.uk/emls/09-1/aunerev.html
http://www.shu.ac.uk/emls/09-1/bestrev.html
http://www.shu.ac.uk/emls/09-1/button.gif
http://www.shu.ac.uk/emls/09-1/clarkrev.html
http://www.shu.ac.uk/emls/09-1/coolarev.html
http://www.shu.ac.uk/emls/09-1/coriorev.html
http://www.shu.ac.uk/emls/09-1/daemsrev.html
http://www.shu.ac.uk/emls/09-1/edwamaps.html
http://www.shu.ac.uk/emls/09-1/edwarev.html
http://www.shu.ac.uk/emls/09-1/emlscopy.html
http://www.shu.ac.uk/emls/09-1/fitzprev.html
http://www.shu.ac.uk/emls/09-1/grosmyre.html
http://www.shu.ac.uk/emls/09-1/hamlcary.html
http://www.shu.ac.uk/emls/09-1/header2.gif
http://www.shu.ac.uk/emls/09-1/ivicrev.html
http://www.shu.ac.uk/emls/09-1/jonesrev.html
http://www.shu.ac.uk/emls/09-1/leahmulc.html
http://www.shu.ac.uk/emls/09-1/mcraerev.html
http://www.shu.ac.uk/emls/09-1/murphrev.html
http://www.shu.ac.uk/emls/09-1/parkscot.html
http://www.shu.ac.uk/emls/09-1/revbook.htm
http://www.shu.ac.uk/emls/09-1/revenrev.html
http://www.shu.ac.uk/emls/09-1/ristdead.html
http://www.shu.ac.uk/emls/09-1/smythrev.html
http://www.shu.ac.uk/emls/09-1/wagnblaz.htm
http://www.shu.ac.uk/emls/09-1/yimtemp.html

Excluded Urls: (2 new, 0 old)
http://purl.oclc.org/emls/emlshome.html
http://www.shu.ac.uk/emls/emlsmast.html

Fetching http://www.shu.ac.uk/emls/09-2/09-2toc.htm
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (22 new, 0 old)
http://www.shu.ac.uk/emls/09-2/abstracts.htm
http://www.shu.ac.uk/emls/09-2/ardonort.html
http://www.shu.ac.uk/emls/09-2/ayli1rev.html
http://www.shu.ac.uk/emls/09-2/ayli2rev.html
http://www.shu.ac.uk/emls/09-2/button.gif
http://www.shu.ac.uk/emls/09-2/ediirev.html
http://www.shu.ac.uk/emls/09-2/eganwood.html
http://www.shu.ac.uk/emls/09-2/emlscopy.html
http://www.shu.ac.uk/emls/09-2/fishball.html
http://www.shu.ac.uk/emls/09-2/gromyrev.html
http://www.shu.ac.uk/emls/09-2/header2.gif
http://www.shu.ac.uk/emls/09-2/henvrev.html
http://www.shu.ac.uk/emls/09-2/kaysep.html
http://www.shu.ac.uk/emls/09-2/kidthom.html
http://www.shu.ac.uk/emls/09-2/michmonu.html
http://www.shu.ac.uk/emls/09-2/mitsfood.html
http://www.shu.ac.uk/emls/09-2/nevipost.html
http://www.shu.ac.uk/emls/09-2/revbook.htm
http://www.shu.ac.uk/emls/09-2/richthir.html
http://www.shu.ac.uk/emls/09-2/roseroch.html
http://www.shu.ac.uk/emls/09-2/stegbest.html
http://www.shu.ac.uk/emls/09-2/tamerrev.html

Excluded Urls: (0 new, 2 old)

Fetching http://www.shu.ac.uk/emls/09-3/09-3toc.htm
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (19 new, 0 old)
http://www.shu.ac.uk/emls/09-3/abstract2.htm
http://www.shu.ac.uk/emls/09-3/bestintr.htm
http://www.shu.ac.uk/emls/09-3/finntabl.htm
http://www.shu.ac.uk/emls/09-3/forsplay.html
http://www.shu.ac.uk/emls/09-3/galedizz.htm
http://www.shu.ac.uk/emls/09-3/hopewhit.htm
http://www.shu.ac.uk/emls/09-3/massrede.htm
http://www.shu.ac.uk/emls/09-3/measrev.html
http://www.shu.ac.uk/emls/09-3/myerrev.html
http://www.shu.ac.uk/emls/09-3/rasmgild.htm
http://www.shu.ac.uk/emls/09-3/revandr.htm
http://www.shu.ac.uk/emls/09-3/revbook.htm
http://www.shu.ac.uk/emls/09-3/revclar.htm
http://www.shu.ac.uk/emls/09-3/revcurra.htm
http://www.shu.ac.uk/emls/09-3/revosto.htm
http://www.shu.ac.uk/emls/09-3/revroth.htm
http://www.shu.ac.uk/emls/09-3/revspil.htm
http://www.shu.ac.uk/emls/09-3/revtarg.htm
http://www.shu.ac.uk/emls/09-3/shrewrev.html

Excluded Urls: (2 new, 3 old)
http://www.shu.ac.uk/emls/07-2/button.gif
http://www.shu.ac.uk/emls/07-2/header2.gif

Fetching http://www.shu.ac.uk/emls/si-13/si-13toc.htm
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (6 new, 3 old)
http://www.shu.ac.uk/emls/si-13/billing/index.htm
http://www.shu.ac.uk/emls/si-13/carson/index.htm
http://www.shu.ac.uk/emls/si-13/egan/index.htm
http://www.shu.ac.uk/emls/si-13/fitzpatrick/index.htm
http://www.shu.ac.uk/emls/si-13/intro.htm
http://www.shu.ac.uk/emls/si-13/technical-note.htm

Excluded Urls: (0 new, 2 old)


Summary for starting Url: http://www.shu.ac.uk/emls/lockss-volume9.html and depth: 2

Urls fetched: 5    Urls extracted: 89

Depth  Fetched  Parsed  New URLs
    1        1       1         4
    2        4       4        74

Remaining unfetched: 74

Elapsed Time: 24 secs.    Fetch Rate: 12 p/m
  • Note that we are still excluding some .gif files that should be included. Pop up the Crawl Rules dialog and add a rule that matches Start followed by base_url followed by Anything followed by the string literal .gif followed by End.
  • Now test these crawl rules again. Configure the AU with the base_url http://www.shu.ac.uk/emls/ and the volume 9. Set the depth to 3. The test will take 5 minutes or so as it waits the 6 seconds between fetching each page.
  • These are the results:
Checking http://www.shu.ac.uk/emls/lockss-volume9.html
crawl depth: 3     crawl delay: 6000 ms.

Depth 1

Fetching http://www.shu.ac.uk/emls/lockss-volume9.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (7 new, 0 old)
http://www.shu.ac.uk/emls/09-1/09-1toc.htm
http://www.shu.ac.uk/emls/09-2/09-2toc.htm
http://www.shu.ac.uk/emls/09-3/09-3toc.htm
http://www.shu.ac.uk/emls/LOCKSS.logo.1%27.gif
http://www.shu.ac.uk/emls/button.gif
http://www.shu.ac.uk/emls/header2.gif
http://www.shu.ac.uk/emls/si-13/si-13toc.htm

Excluded Urls: (2 new, 0 old)
http://www.shu.ac.uk/emls/emlscopy.html
http://www.shu.ac.uk/emls/emlshome.html

Depth 2

Fetching http://www.shu.ac.uk/emls/09-1/09-1toc.htm
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (27 new, 0 old)
http://www.shu.ac.uk/emls/09-1/abstracts.htm
http://www.shu.ac.uk/emls/09-1/aunerev.html
http://www.shu.ac.uk/emls/09-1/bestrev.html
http://www.shu.ac.uk/emls/09-1/button.gif
http://www.shu.ac.uk/emls/09-1/clarkrev.html
http://www.shu.ac.uk/emls/09-1/coolarev.html
http://www.shu.ac.uk/emls/09-1/coriorev.html
http://www.shu.ac.uk/emls/09-1/daemsrev.html
http://www.shu.ac.uk/emls/09-1/edwamaps.html
http://www.shu.ac.uk/emls/09-1/edwarev.html
http://www.shu.ac.uk/emls/09-1/emlscopy.html
http://www.shu.ac.uk/emls/09-1/fitzprev.html
http://www.shu.ac.uk/emls/09-1/grosmyre.html
http://www.shu.ac.uk/emls/09-1/hamlcary.html
http://www.shu.ac.uk/emls/09-1/header2.gif
http://www.shu.ac.uk/emls/09-1/ivicrev.html
http://www.shu.ac.uk/emls/09-1/jonesrev.html
http://www.shu.ac.uk/emls/09-1/leahmulc.html
http://www.shu.ac.uk/emls/09-1/mcraerev.html
http://www.shu.ac.uk/emls/09-1/murphrev.html
http://www.shu.ac.uk/emls/09-1/parkscot.html
http://www.shu.ac.uk/emls/09-1/revbook.htm
http://www.shu.ac.uk/emls/09-1/revenrev.html
http://www.shu.ac.uk/emls/09-1/ristdead.html
http://www.shu.ac.uk/emls/09-1/smythrev.html
http://www.shu.ac.uk/emls/09-1/wagnblaz.htm
http://www.shu.ac.uk/emls/09-1/yimtemp.html

Excluded Urls: (2 new, 0 old)
http://purl.oclc.org/emls/emlshome.html
http://www.shu.ac.uk/emls/emlsmast.html

Fetching http://www.shu.ac.uk/emls/09-2/09-2toc.htm
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (22 new, 0 old)
http://www.shu.ac.uk/emls/09-2/abstracts.htm
http://www.shu.ac.uk/emls/09-2/ardonort.html
http://www.shu.ac.uk/emls/09-2/ayli1rev.html
http://www.shu.ac.uk/emls/09-2/ayli2rev.html
http://www.shu.ac.uk/emls/09-2/button.gif
http://www.shu.ac.uk/emls/09-2/ediirev.html
http://www.shu.ac.uk/emls/09-2/eganwood.html
http://www.shu.ac.uk/emls/09-2/emlscopy.html
http://www.shu.ac.uk/emls/09-2/fishball.html
http://www.shu.ac.uk/emls/09-2/gromyrev.html
http://www.shu.ac.uk/emls/09-2/header2.gif
http://www.shu.ac.uk/emls/09-2/henvrev.html
http://www.shu.ac.uk/emls/09-2/kaysep.html
http://www.shu.ac.uk/emls/09-2/kidthom.html
http://www.shu.ac.uk/emls/09-2/michmonu.html
http://www.shu.ac.uk/emls/09-2/mitsfood.html
http://www.shu.ac.uk/emls/09-2/nevipost.html
http://www.shu.ac.uk/emls/09-2/revbook.htm
http://www.shu.ac.uk/emls/09-2/richthir.html
http://www.shu.ac.uk/emls/09-2/roseroch.html
http://www.shu.ac.uk/emls/09-2/stegbest.html
http://www.shu.ac.uk/emls/09-2/tamerrev.html

Excluded Urls: (0 new, 2 old)

Fetching http://www.shu.ac.uk/emls/09-3/09-3toc.htm
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (21 new, 0 old)
http://www.shu.ac.uk/emls/07-2/button.gif
http://www.shu.ac.uk/emls/07-2/header2.gif
http://www.shu.ac.uk/emls/09-3/abstract2.htm
http://www.shu.ac.uk/emls/09-3/bestintr.htm
http://www.shu.ac.uk/emls/09-3/finntabl.htm
http://www.shu.ac.uk/emls/09-3/forsplay.html
http://www.shu.ac.uk/emls/09-3/galedizz.htm
http://www.shu.ac.uk/emls/09-3/hopewhit.htm
http://www.shu.ac.uk/emls/09-3/massrede.htm
http://www.shu.ac.uk/emls/09-3/measrev.html
http://www.shu.ac.uk/emls/09-3/myerrev.html
http://www.shu.ac.uk/emls/09-3/rasmgild.htm
http://www.shu.ac.uk/emls/09-3/revandr.htm
http://www.shu.ac.uk/emls/09-3/revbook.htm
http://www.shu.ac.uk/emls/09-3/revclar.htm
http://www.shu.ac.uk/emls/09-3/revcurra.htm
http://www.shu.ac.uk/emls/09-3/revosto.htm
http://www.shu.ac.uk/emls/09-3/revroth.htm
http://www.shu.ac.uk/emls/09-3/revspil.htm
http://www.shu.ac.uk/emls/09-3/revtarg.htm
http://www.shu.ac.uk/emls/09-3/shrewrev.html

Excluded Urls: (0 new, 3 old)

Fetching http://www.shu.ac.uk/emls/LOCKSS.logo.1%27.gif
Type: image/gif, not parsing

Fetching http://www.shu.ac.uk/emls/button.gif
Type: image/gif, not parsing

Fetching http://www.shu.ac.uk/emls/header2.gif
Type: image/gif, not parsing

Fetching http://www.shu.ac.uk/emls/si-13/si-13toc.htm
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (6 new, 3 old)
http://www.shu.ac.uk/emls/si-13/billing/index.htm
http://www.shu.ac.uk/emls/si-13/carson/index.htm
http://www.shu.ac.uk/emls/si-13/egan/index.htm
http://www.shu.ac.uk/emls/si-13/fitzpatrick/index.htm
http://www.shu.ac.uk/emls/si-13/intro.htm
http://www.shu.ac.uk/emls/si-13/technical-note.htm

Excluded Urls: (0 new, 2 old)

Depth 3

Fetching http://www.shu.ac.uk/emls/07-2/button.gif
Type: image/gif, not parsing

Fetching http://www.shu.ac.uk/emls/07-2/header2.gif
Type: image/gif, not parsing

Fetching http://www.shu.ac.uk/emls/09-1/abstracts.htm
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 4 old)

Excluded Urls: (0 new, 1 old)

Fetching http://www.shu.ac.uk/emls/09-1/aunerev.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 3 old)

Excluded Urls: (1 new, 1 old)
http://purl.oclc.org/emls/09-1/aunerev.html

Fetching http://www.shu.ac.uk/emls/09-1/bestrev.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 4 old)

Excluded Urls: (4 new, 1 old)
http://purl.oclc.org/emls/03-3/bestshak.html
http://purl.oclc.org/emls/09-1/bestrev.html
http://www.arts.uwa.edu.au/MotsPluriels/MP1901pf.html
http://www.shu.ac.uk/emls/03-3/bestshak.html

Fetching http://www.shu.ac.uk/emls/09-1/button.gif
Type: image/gif, not parsing

Fetching http://www.shu.ac.uk/emls/09-1/clarkrev.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 4 old)

Excluded Urls: (1 new, 1 old)
http://purl.oclc.org/emls/09-1/clarkrev.html

Fetching http://www.shu.ac.uk/emls/09-1/coolarev.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 3 old)

Excluded Urls: (1 new, 1 old)
http://purl.oclc.org/emls/09-1/coolarev.html

Fetching http://www.shu.ac.uk/emls/09-1/coriorev.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 3 old)

Excluded Urls: (1 new, 1 old)
http://purl.oclc.org/emls/09-1/coriorev.html

Fetching http://www.shu.ac.uk/emls/09-1/daemsrev.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 3 old)

Excluded Urls: (1 new, 1 old)
http://purl.oclc.org/emls/09-1/daemsrev.html

Fetching http://www.shu.ac.uk/emls/09-1/edwamaps.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 4 old)

Excluded Urls: (1 new, 1 old)
http://purl.oclc.org/emls/09-1/edwamaps.html

Fetching http://www.shu.ac.uk/emls/09-1/edwarev.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 3 old)

Excluded Urls: (2 new, 1 old)
http://purl.oclc.org/emls/09-1/edwarev.html
http://www.shu.ac.uk/emls/04-2/mcraonth.htm

Fetching http://www.shu.ac.uk/emls/09-1/emlscopy.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 2 old)

Excluded Urls: (0 new, 1 old)

Fetching http://www.shu.ac.uk/emls/09-1/fitzprev.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 3 old)

Excluded Urls: (1 new, 1 old)
http://purl.oclc.org/emls/09-1/fitzprev.html

Fetching http://www.shu.ac.uk/emls/09-1/grosmyre.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 3 old)

Excluded Urls: (1 new, 1 old)
http://purl.oclc.org/emls/09-1/grosmyre.html

Fetching http://www.shu.ac.uk/emls/09-1/hamlcary.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 4 old)

Excluded Urls: (1 new, 1 old)
http://purl.oclc.org/emls/09-1/hamlcary.html

Fetching http://www.shu.ac.uk/emls/09-1/header2.gif
Type: image/gif, not parsing

Fetching http://www.shu.ac.uk/emls/09-1/ivicrev.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 3 old)

Excluded Urls: (1 new, 1 old)
http://purl.oclc.org/emls/09-1/ivicrev.html

Fetching http://www.shu.ac.uk/emls/09-1/jonesrev.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 3 old)

Excluded Urls: (1 new, 1 old)
http://purl.oclc.org/emls/09-1/jonesrev.html

Fetching http://www.shu.ac.uk/emls/09-1/leahmulc.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 3 old)

Excluded Urls: (1 new, 1 old)
http://purl.oclc.org/emls/09-1/leahmulc.html

Fetching http://www.shu.ac.uk/emls/09-1/mcraerev.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 3 old)

Excluded Urls: (1 new, 1 old)
http://purl.oclc.org/emls/09-1/mcraerev.html

Fetching http://www.shu.ac.uk/emls/09-1/murphrev.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 3 old)

Excluded Urls: (1 new, 1 old)
http://purl.oclc.org/emls/09-1/murphrev.html

Fetching http://www.shu.ac.uk/emls/09-1/parkscot.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 3 old)

Excluded Urls: (1 new, 1 old)
http://purl.oclc.org/emls/09-1/parkscot.html

Fetching http://www.shu.ac.uk/emls/09-1/revbook.htm
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 4 old)

Excluded Urls: (1 new, 0 old)
file:/D%7c/EMLS/emlshome.html

Fetching http://www.shu.ac.uk/emls/09-1/revenrev.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 3 old)

Excluded Urls: (2 new, 1 old)
http://purl.oclc.org/emls/09-1/revenrev.html
http://www.revengerstragedy.com/

Fetching http://www.shu.ac.uk/emls/09-1/ristdead.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 4 old)

Excluded Urls: (1 new, 1 old)
http://purl.oclc.org/emls/09-1/ristdead.html

Fetching http://www.shu.ac.uk/emls/09-1/smythrev.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 3 old)

Excluded Urls: (1 new, 1 old)
http://purl.oclc.org/emls/09-1/smythrev.html

Fetching http://www.shu.ac.uk/emls/09-1/wagnblaz.htm
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 4 old)

Excluded Urls: (1 new, 1 old)
http://purl.oclc.org/emls/09-1/wagnblaz.htm

Fetching http://www.shu.ac.uk/emls/09-1/yimtemp.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 4 old)

Excluded Urls: (1 new, 1 old)
http://purl.oclc.org/emls/09-1/yimtemp.html

Fetching http://www.shu.ac.uk/emls/09-2/abstracts.htm
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 4 old)

Excluded Urls: (0 new, 1 old)

Fetching http://www.shu.ac.uk/emls/09-2/ardonort.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Excluded Urls: (4 new, 1 old)
file:/D%7c/EMLS/09-2/button.gif
file:/D%7c/EMLS/09-2/emlscopy.html
file:/D%7c/EMLS/09-2/header2.gif
http://purl.oclc.org/emls/09-2/ardonort.html

Fetching http://www.shu.ac.uk/emls/09-2/ayli1rev.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 3 old)

Excluded Urls: (1 new, 1 old)
http://purl.oclc.org/emls/09-2/ayli1rev.html

Fetching http://www.shu.ac.uk/emls/09-2/ayli2rev.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 3 old)

Excluded Urls: (1 new, 1 old)
http://purl.oclc.org/emls/09-2/ayli2rev.html

Fetching http://www.shu.ac.uk/emls/09-2/button.gif
Type: image/gif, not parsing

Fetching http://www.shu.ac.uk/emls/09-2/ediirev.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 3 old)

Excluded Urls: (1 new, 1 old)
http://purl.oclc.org/emls/09-2/ediirev.html

Fetching http://www.shu.ac.uk/emls/09-2/eganwood.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 2 old)

Excluded Urls: (1 new, 2 old)
http://purl.oclc.org/emls/09-2/stegbest.html

Fetching http://www.shu.ac.uk/emls/09-2/emlscopy.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 2 old)

Excluded Urls: (0 new, 1 old)

Fetching http://www.shu.ac.uk/emls/09-2/fishball.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 4 old)

Excluded Urls: (2 new, 1 old)
http://purl.oclc.org/emls/02-3/razoball.html
http://purl.oclc.org/emls/09-2/fishball.html

Fetching http://www.shu.ac.uk/emls/09-2/gromyrev.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 3 old)

Excluded Urls: (1 new, 1 old)
http://purl.oclc.org/emls/09-2/gromyrev.html

Fetching http://www.shu.ac.uk/emls/09-2/header2.gif
Type: image/gif, not parsing

Fetching http://www.shu.ac.uk/emls/09-2/henvrev.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 3 old)

Excluded Urls: (1 new, 1 old)
http://purl.oclc.org/emls/09-2/henvrev.html

Fetching http://www.shu.ac.uk/emls/09-2/kaysep.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 4 old)

Excluded Urls: (1 new, 1 old)
http://purl.oclc.org/emls/09-2/kaysep.html

Fetching http://www.shu.ac.uk/emls/09-2/kidthom.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 3 old)

Excluded Urls: (1 new, 1 old)
http://purl.oclc.org/emls/09-2/kidthom.html

Fetching http://www.shu.ac.uk/emls/09-2/michmonu.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 4 old)

Excluded Urls: (10 new, 1 old)
http://emblem.libraries.psu.edu/catalog.htm
http://purl.oclc.org/emls/09-2/michmonu.html
http://www.alexjennings.com/
http://www.cambria.demon.co.uk/-theatr/critical/design.htm
http://www.chass.utoronto.ca/english/emed/-patterweb.html
http://www.gunpowder-plot.org/people/h_garnet.htm
http://www.mun.ca/alciato/whit/w030.html
http://www.mun.ca/alciato/whit/w131.html
http://www.mun.ca/alciato/whit/w193.html
http://www.mun.ca/alciato/whit/w229b.html

Fetching http://www.shu.ac.uk/emls/09-2/mitsfood.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 4 old)

Excluded Urls: (2 new, 1 old)
http://darkwing.uoregon.edu/%7Erbear/ren.htm%20
http://purl.oclc.org/emls/09-2/mitsfood.html

Fetching http://www.shu.ac.uk/emls/09-2/nevipost.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Excluded Urls: (1 new, 4 old)
http://purl.oclc.org/emls/09-2/nevipost.html

Fetching http://www.shu.ac.uk/emls/09-2/revbook.htm
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 4 old)

Excluded Urls: (0 new, 1 old)

Fetching http://www.shu.ac.uk/emls/09-2/richthir.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 4 old)

Excluded Urls: (2 new, 1 old)
http://purl.oclc.org/emls/09-2/richthir.html
http://www.ccel.org/fathers2/ANF-04/anf04-44.htm

Fetching http://www.shu.ac.uk/emls/09-2/roseroch.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 4 old)

Excluded Urls: (1 new, 1 old)
http://purl.oclc.org/emls/09-2/roseroch.html

Fetching http://www.shu.ac.uk/emls/09-2/stegbest.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (2 new, 0 old)
http://www.shu.ac.uk/emls/09-2/image002.jpg
http://www.shu.ac.uk/emls/09-2/image004.jpg

Excluded Urls: (2 new, 5 old)
http://web.uvic.ca/shakespeare/
http://www.cict.co.uk/software

Fetching http://www.shu.ac.uk/emls/09-2/tamerrev.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 3 old)

Excluded Urls: (1 new, 1 old)
http://purl.oclc.org/emls/09-2/tamerrev.html

Fetching http://www.shu.ac.uk/emls/09-3/abstract2.htm
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 9 old)

Excluded Urls: (0 new, 2 old)

Fetching http://www.shu.ac.uk/emls/09-3/bestintr.htm
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (2 new, 1 old)
http://www.shu.ac.uk/emls/09-3/images/button.gif
http://www.shu.ac.uk/emls/09-3/images/header2.gif

Excluded Urls: (10 new, 2 old)
http://dewey.library.upenn.edu/sceti/furness/
http://dieoff.com/page95.htm
http://purl.oclc.org/emls/03-3/03-3toc.html
http://purl.oclc.org/emls/09-3/bestintr.htm
http://purl.oclc.org/emls/emlscopy.html
http://web.mala.bc.ca/hssfc/Final/Credibility.htm
http://web.mala.bc.ca/hssfc/Final/QuestionnaireR.htm
http://web.uvic.ca/shakespeare/Annex/DraftTxt/
http://web.uvic.ca/shakespeare/Library/Texts/
http://web.uvic.ca/shakespeare/index.html

Fetching http://www.shu.ac.uk/emls/09-3/finntabl.htm
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 3 old)

Excluded Urls: (5 new, 3 old)
http://purl.oclc.org/emls/09-3/finntabl.htm
http://web.uvic.ca/shakespeare/Library/SLT/index.html
http://www.iath.virginia.edu/rossetti/
http://www.iath.virginia.edu/rossetti/index.html
http://www.quodlibet.net/gift.shtml

Fetching http://www.shu.ac.uk/emls/09-3/forsplay.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 3 old)

Excluded Urls: (2 new, 2 old)
http://purl.oclc.org/emls/09-3/forsplay.html
http://www.wikipedia.org/wiki/Shakespeare

Fetching http://www.shu.ac.uk/emls/09-3/galedizz.htm
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (2 new, 3 old)
http://www.shu.ac.uk/emls/09-3/images/longS.gif
http://www.shu.ac.uk/emls/si-02/si-02toc.html

Excluded Urls: (8 new, 2 old)
http://purl.oclc.org/emls/09-3/galedizz.htm
http://web.uvic.ca/shakespeare/Annex/DraftTxt/Ado/index.html
http://web.uvic.ca/shakespeare/Annex/DraftTxt/Cor/Cor_FPages/Cor_Fcc2v.html
http://web.uvic.ca/shakespeare/Annex/DraftTxt/Ham/Ham_Q2/Ham_Q2Pages/Ham_Q2N2v.html
http://web.uvic.ca/shakespeare/Annex/DraftTxt/Pref/PrefPages/Pref.html
http://web.uvic.ca/shakespeare/Annex/DraftTxt/Tmp/Tmp_FPages/Tmp_FB3.html
http://www.tei-c.org/P4X/
http://www.tei-c.org/P4X/CO.html

Fetching http://www.shu.ac.uk/emls/09-3/hopewhit.htm
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (9 new, 3 old)
http://www.shu.ac.uk/emls/09-3/hopewhitmore_files/image001.jpg
http://www.shu.ac.uk/emls/09-3/hopewhitmore_files/image002.jpg
http://www.shu.ac.uk/emls/09-3/hopewhitmore_files/image003.jpg
http://www.shu.ac.uk/emls/09-3/hopewhitmore_files/image004.jpg
http://www.shu.ac.uk/emls/09-3/hopewhitmore_files/image005.jpg
http://www.shu.ac.uk/emls/09-3/hopewhitmore_files/image006.jpg
http://www.shu.ac.uk/emls/09-3/hopewhitmore_files/image007.jpg
http://www.shu.ac.uk/emls/09-3/hopewhitmore_files/image008.jpg
http://www.shu.ac.uk/emls/09-3/hopewhitmore_files/image009.jpg

Excluded Urls: (1 new, 2 old)
http://purl.oclc.org/emls/09-3/hopewhit.htm

Fetching http://www.shu.ac.uk/emls/09-3/massrede.htm
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 3 old)

Excluded Urls: (1 new, 2 old)
http://purl.oclc.org/emls/09-3/massrede.htm

Fetching http://www.shu.ac.uk/emls/09-3/measrev.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 2 old)

Excluded Urls: (1 new, 2 old)
http://purl.oclc.org/emls/09-3/measrev.html

Fetching http://www.shu.ac.uk/emls/09-3/myerrev.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 2 old)

Excluded Urls: (1 new, 2 old)
http://purl.oclc.org/emls/09-3/myerrev.html

Fetching http://www.shu.ac.uk/emls/09-3/rasmgild.htm
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 3 old)

Excluded Urls: (5 new, 2 old)
http://purl.oclc.org/emls/09-3/rasmgild.htm
http://web.uvic.ca/shakespeare/Library/Texts/Poems/Ven/Ven_QI/Ven_QIE4v.html
http://web.uvic.ca/shakespeare/Library/Texts/Poems/Ven/Ven_QI/Ven_QIH1v.html
http://web.uvic.ca/shakespeare/Library/Texts/Poems/Ven/Ven_QT/Ven_QPages/Ven_QE4v.html
http://web.uvic.ca/shakespeare/Library/Texts/Poems/Ven/index.html

Fetching http://www.shu.ac.uk/emls/09-3/revandr.htm
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (1 new, 2 old)
http://www.shu.ac.uk/emls/09-3/emlscopy.html

Excluded Urls: (1 new, 1 old)
http://purl.oclc.org/emls/09-3/revandr.htm

Fetching http://www.shu.ac.uk/emls/09-3/revbook.htm
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 4 old)

Excluded Urls: (0 new, 1 old)

Fetching http://www.shu.ac.uk/emls/09-3/revclar.htm
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 3 old)

Excluded Urls: (1 new, 1 old)
http://purl.oclc.org/emls/09-3/revclar.htm

Fetching http://www.shu.ac.uk/emls/09-3/revcurra.htm
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 2 old)

Excluded Urls: (2 new, 1 old)
file:/C%7c/Documents%20and%20Settings/scsms1/Desktop/emlscopy.html
http://purl.oclc.org/emls/09-3/revcurra.htm

Fetching http://www.shu.ac.uk/emls/09-3/revosto.htm
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 3 old)

Excluded Urls: (1 new, 1 old)
http://purl.oclc.org/emls/09-3/revosto.htm

Fetching http://www.shu.ac.uk/emls/09-3/revroth.htm
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 3 old)

Excluded Urls: (2 new, 1 old)
http://assets.cup.org/0521822556/sample/0521822556WS.pdf
http://purl.oclc.org/emls/09-3/revroth.htm

Fetching http://www.shu.ac.uk/emls/09-3/revspil.htm
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 3 old)

Excluded Urls: (1 new, 1 old)
http://purl.oclc.org/emls/09-3/revspil.htm

Fetching http://www.shu.ac.uk/emls/09-3/revtarg.htm
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 2 old)

Excluded Urls: (1 new, 2 old)
http://purl.oclc.org/emls/09-3/revtarg.htm

Fetching http://www.shu.ac.uk/emls/09-3/shrewrev.html
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 2 old)

Excluded Urls: (1 new, 2 old)
http://purl.oclc.org/emls/09-3/shrewrev.html

Fetching http://www.shu.ac.uk/emls/si-13/billing/index.htm
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (21 new, 3 old)
http://www.shu.ac.uk/emls/si-13/billing/pictures/barber2.wrl
http://www.shu.ac.uk/emls/si-13/billing/pictures/fig1.gif
http://www.shu.ac.uk/emls/si-13/billing/pictures/fig10.jpg
http://www.shu.ac.uk/emls/si-13/billing/pictures/fig10small.jpg
http://www.shu.ac.uk/emls/si-13/billing/pictures/fig1small.gif
http://www.shu.ac.uk/emls/si-13/billing/pictures/fig2.gif
http://www.shu.ac.uk/emls/si-13/billing/pictures/fig2small.gif
http://www.shu.ac.uk/emls/si-13/billing/pictures/fig3.gif
http://www.shu.ac.uk/emls/si-13/billing/pictures/fig3small.gif
http://www.shu.ac.uk/emls/si-13/billing/pictures/fig4.gif
http://www.shu.ac.uk/emls/si-13/billing/pictures/fig4small.gif
http://www.shu.ac.uk/emls/si-13/billing/pictures/fig5.gif
http://www.shu.ac.uk/emls/si-13/billing/pictures/fig5small.gif
http://www.shu.ac.uk/emls/si-13/billing/pictures/fig6.jpg
http://www.shu.ac.uk/emls/si-13/billing/pictures/fig6small.jpg
http://www.shu.ac.uk/emls/si-13/billing/pictures/fig7.jpg
http://www.shu.ac.uk/emls/si-13/billing/pictures/fig7small.jpg
http://www.shu.ac.uk/emls/si-13/billing/pictures/fig8.jpg
http://www.shu.ac.uk/emls/si-13/billing/pictures/fig8small.jpg
http://www.shu.ac.uk/emls/si-13/billing/pictures/fig9.jpg
http://www.shu.ac.uk/emls/si-13/billing/pictures/fig9small.jpg

Excluded Urls: (1 new, 2 old)
http://purl.oclc.org/emls/si-13/billing

Fetching http://www.shu.ac.uk/emls/si-13/carson/index.htm
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (14 new, 4 old)
http://www.shu.ac.uk/emls/si-13/carson/models/aldwych.wrl
http://www.shu.ac.uk/emls/si-13/carson/models/barbican.wrl
http://www.shu.ac.uk/emls/si-13/carson/models/cottesloe.wrl
http://www.shu.ac.uk/emls/si-13/carson/models/globe.wrl
http://www.shu.ac.uk/emls/si-13/carson/models/lyttleton.wrl
http://www.shu.ac.uk/emls/si-13/carson/models/olivier.wrl
http://www.shu.ac.uk/emls/si-13/carson/models/rst.wrl
http://www.shu.ac.uk/emls/si-13/carson/models/swan.wrl
http://www.shu.ac.uk/emls/si-13/carson/models/top_new.wrl
http://www.shu.ac.uk/emls/si-13/carson/models/top_old.wrl
http://www.shu.ac.uk/emls/si-13/carson/pictures/fig1small.jpg
http://www.shu.ac.uk/emls/si-13/carson/pictures/fig2small.jpg
http://www.shu.ac.uk/emls/si-13/carson/pictures/fig3small.jpg
http://www.shu.ac.uk/emls/si-13/carson/pictures/fig4small.jpg

Excluded Urls: (4 new, 2 old)
http://purl.oclc.org/emls/si-13/carson
http://purl.oclc.org/emls/si-13/egan
http://www.openstages.com/help/help.htm
http://www.pads.ahds.ac.uk/

Fetching http://www.shu.ac.uk/emls/si-13/egan/index.htm
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (12 new, 3 old)
http://www.shu.ac.uk/emls/si-13/egan/emlscopy.html
http://www.shu.ac.uk/emls/si-13/egan/pictures/Figure1.gif
http://www.shu.ac.uk/emls/si-13/egan/pictures/Figure2.gif
http://www.shu.ac.uk/emls/si-13/egan/pictures/Figure3.gif
http://www.shu.ac.uk/emls/si-13/egan/pictures/Figure4a.gif
http://www.shu.ac.uk/emls/si-13/egan/pictures/Figure4b.gif
http://www.shu.ac.uk/emls/si-13/egan/pictures/Figure5a.gif
http://www.shu.ac.uk/emls/si-13/egan/pictures/Figure5b.gif
http://www.shu.ac.uk/emls/si-13/egan/pictures/Figure6a.gif
http://www.shu.ac.uk/emls/si-13/egan/pictures/Figure6b.gif
http://www.shu.ac.uk/emls/si-13/egan/pictures/Figure7a.gif
http://www.shu.ac.uk/emls/si-13/egan/pictures/Figure7b.gif

Excluded Urls: (0 new, 2 old)

Fetching http://www.shu.ac.uk/emls/si-13/fitzpatrick/index.htm
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (15 new, 2 old)
http://www.shu.ac.uk/emls/si-13/fitzpatrick/pictures/fig1.jpg
http://www.shu.ac.uk/emls/si-13/fitzpatrick/pictures/fig10.gif
http://www.shu.ac.uk/emls/si-13/fitzpatrick/pictures/fig11.gif
http://www.shu.ac.uk/emls/si-13/fitzpatrick/pictures/fig12.gif
http://www.shu.ac.uk/emls/si-13/fitzpatrick/pictures/fig13.gif
http://www.shu.ac.uk/emls/si-13/fitzpatrick/pictures/fig1small.jpg
http://www.shu.ac.uk/emls/si-13/fitzpatrick/pictures/fig2.jpg
http://www.shu.ac.uk/emls/si-13/fitzpatrick/pictures/fig2small.jpg
http://www.shu.ac.uk/emls/si-13/fitzpatrick/pictures/fig3.gif
http://www.shu.ac.uk/emls/si-13/fitzpatrick/pictures/fig4.gif
http://www.shu.ac.uk/emls/si-13/fitzpatrick/pictures/fig5.gif
http://www.shu.ac.uk/emls/si-13/fitzpatrick/pictures/fig6.gif
http://www.shu.ac.uk/emls/si-13/fitzpatrick/pictures/fig7.gif
http://www.shu.ac.uk/emls/si-13/fitzpatrick/pictures/fig8.gif
http://www.shu.ac.uk/emls/si-13/fitzpatrick/pictures/fig9.gif

Excluded Urls: (1 new, 2 old)
http://purl.oclc.org/emls/si-13/fitzpatrick

Fetching http://www.shu.ac.uk/emls/si-13/intro.htm
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (0 new, 3 old)

Excluded Urls: (1 new, 1 old)
http://purl.oclc.org/emls/si-13/intro.htm

Fetching http://www.shu.ac.uk/emls/si-13/technical-note.htm
Type: text/html; charset=ISO-8859-1, extracting Urls

Included Urls: (4 new, 3 old)
http://www.shu.ac.uk/emls/si-13/plug-ins/Cortona%20for%20Mac/cortvrml_ppc.bin
http://www.shu.ac.uk/emls/si-13/plug-ins/Cortona%20for%20Windows/cortvrml.exe
http://www.shu.ac.uk/emls/si-13/plug-ins/Cosmo%20for%20Mac/cosmo_mac.hqx
http://www.shu.ac.uk/emls/si-13/plug-ins/Cosmo%20for%20Windows/cosmo_win95nt_eng.exe

Excluded Urls: (0 new, 1 old)


Summary for starting Url: http://www.shu.ac.uk/emls/lockss-volume9.html and depth: 3

Urls fetched: 84    Urls extracted: 804

Depth  Fetched  Parsed  New URLs
    1        1       1         7
    2        7       4        76
    3       76      70        82

Remaining unfetched: 82

Elapsed Time: 498 secs.    Fetch Rate: 10 p/m
  • Reviewing this rather long list, it appears that the crawl rules are now including and excluding the correct files. Note that among the files included are programs for Windows and the Mac, .gif and .jpg images and .wrl files.