Web Science and Digital Libraries Research Group: Composite Memento

Showing posts with label Composite Memento. Show all posts

Monday, January 8, 2018

2018-01-08: Introducing Reconstructive - An Archival Replay ServiceWorker Module

Web pages are generally composed of many resource such as images, style sheets, JavaScript, fonts, iframe widgets, and other embedded media. These embedded resources can be referenced in many ways (such as relative path, absolute path, or a full URL). When the same page is archived and replayed from a different domain under a different base path, these references may not resolve as intended, hence, may result in a damaged memento. For example, a memento (an archived copy) of the web page https://www.odu.edu/ can be seen at https://web.archive.org/web/20180107155037/https://www.odu.edu/. Note that domain name has changed from www.odu.edu to web.archive.org and some extra path segments are added to it. In order for this page to render properly, various resource references in it are rewritten, for example, images/logo-university.png in a CSS file is replaced with /web/20171225230642im_/http://www.odu.edu/etc/designs/odu/images/logo-university.png.

Traditionally, web archival replay systems rewrite link and resource references in HTML/CSS/JavaScript responses so that they resolve to their corresponding archival version. Failure to do so would result in a broken rendering of archived pages (composite mementos) as the embedded resource references might resolve to their live version or an invalid location. With the growing use of JavaScript in web applications, often resources are injected dynamically, hence rewriting such references is not possible from the server side. To mitigate this issue, some JavaScript is injected in the page that overrides the global namespace to modify the DOM and monitor all network activity. In JCDL17 and WADL17 we proposed a ServiceWorker-based solution to this issue that requires no server-side rewriting, but catches every network request, even those that were initiated due to dynamic resource injection. Read our paper for more details.

Sawood Alam, Mat Kelly, Michele C. Weigle and Michael L. Nelson, "Client-side Reconstruction of Composite Mementos Using ServiceWorker," In JCDL '17: Proceedings of the 17th ACM/IEEE-CS Joint Conference on Digital Libraries. June 2017, pp. 237-240.

Avoiding Zombies in Archival Replay Using ServiceWorker by Sawood Alam

URL Rewriting

There are primarily three ways to reference a resource from another resource, namely, relative path, absolute path, and absolute URL. All three have their own challenges when served from an archive (or from a different origin and/or path than the original). In the case of archival replay, both the origin and base paths are changed from the original while original origin and paths usually become part of the new path. Relative paths are often the easiest to replay as they are not tied to the origin or the root path, but they cannot be used for external resources. Absolute paths and absolute URLs on the other hand are resolved incorrectly or live-leaked when a primary resource is served from an archive, neither of these conditions are desired in archival replay. There is a fourth way of referencing a resource called schemeless (or protocol-relative) that starts with two forward slashes followed by a domain name and paths. However, usually web archives ignore the scheme part of the URI when canonicalizing URLs, so we can focus on just three main ways. The following table illustrates examples of each with their resolution issues.

Reference type	Example	Resolution after relocation
Relative path	images/logo.png	Potentially correct
Absolute path	/public/images/logo.png	Potentially incorrect
Absolute URL	http://example.com/public/images/logo.png	Potentially live leakage

Archival replay systems (such as OpenWayback and PyWB) rewrite responses before serving to the client in a way that various resource references point to their corresponding archival page. Suppose a page, originally located at http://example.com/public/index.html, has an image in it that is referenced as <img src="/public/images/logo.png">. When the same page is served from an archive at http://archive.example.org/<datetime>/http://example.com/public/index.html, the image reference needs to be rewritten as <img src="/<datetime>/http://example.com/public/images/logo.png"> in order for it to work as desired. However, URLs constructed by JavaScript, dynamically on the client-side are difficult to rewrite just by the static analysis of the code at server end. With the rising usage of JavaScript in web pages, it is becoming more challenging for the archival replay systems to correctly replay archived web pages.

ServiceWorker

ServiceWorker is a new web API that can be used to intercept all the network requests within its scope or originated from its scope (with a few exceptions such as an external iframe source). A web page first delivers a ServiceWorker script and installs it in the browser, which is registered to watch for all requests from a scoped path under the same origin. Once installed, it persists for a long time and intercepts all subsequent requests withing its scope. An active ServiceWorker sits in the middle of the client and the server as a proxy (which is built-in to the browser). It can change both requests and responses as necessary. The primary use-case of the API is to provide better offline experience in web apps by serving pages from a client-side cache when there is no network or populating/synchronizing the cache. However, we found it useful to solve an archival replay problem.

Reconstructive

We created Reconstructive, a ServiceWorker module for archival replay that sits on the client-side and intercepts every potential archival request to properly reroute it. This approach requires no rewrites from the server side. It is being used successfully in our IPFS-based archival replay system called InterPlanetary Wayback (IPWB). The main objective of this module is to help reconstruct (hence the name) a composite memento (from one or more archives) while preventing from any live-leaks (also known as zombie resources) or wrong URL resolutions.

Unicorns and zombies in #WebArchiving practice, from @ibnesayeed. On leaks from live web into archives. #jcdl2017 #wadl2017 pic.twitter.com/SpfI0ynMCN
— Ian Milligan (@ianmilligan1) June 22, 2017

The following figure illustrates an example where an external image reference in an archived web page would have leaked to the live-web, but due to the presence of Reconstructive, it was successfully rerouted to the corresponding archived copy instead.

In order to reroute requests to the URI of a potential archived copy (also known as Memento URI or URI-M) Reconstructive needs the request URL and the referrer URL, of which the latter must be a URI-M. It extracts the datetime and the original URI (or URI-R) of the referrer then combines them with the request URL as necessary to construct a potential URI-M for the request to be rerouted to. If the request URL is already a URI-M, it simply adds a custom request header X-ServiceWorker and fetches the response from the server. When necessary, the response is rewritten on the client-side to fix some quirks to make sure that the replay works as expected or to optionally add an archival banner. The following flowchart diagram shows what happens in every request/response cycle of a fetch event in Reconstructive.

We have also released an Archival Capture Replay Test Suite (ACRTS) to test the rerouting functionality in different scenarios. It is similar to our earlier Archival Acid Test, but more focused on URI references and network activities. The test suite comes with a pre-captured WARC file of a live test page. captured resources are all green while the live site has everything red. The WARC file can be replayed using any archival replay system to test how well the system is replaying archived resources. In the test suite a green box means properly rerouting, red box means a live-leakage, while white/gray means incorrectly resolving the reference.

Module Usage

The module is intended to be used by archival replay systems backed by Memento endpoints. It can be a web archive such as IPWB or a Memento aggregator such as MemGator. In order use the module, write a ServiceWorker script (say, serviceworker.js) with your own logic to register and update it. In that script, import reconstructive.js script (locally or externally) which will make the Reconstructive module available with all of its public members/functions. Then bind the fetch event listener to the publicly exposed Reconstructive.reroute function.

importScripts('https://oduwsdl.github.io/Reconstructive/reconstructive.js');
const rc = new Reconstructive();
self.addEventListener('fetch', rc.reroute);

This will start rerouting every request according to a default URI-M pattern while excluding some requests that match a default set of exclusion rules. However, URI-M pattern, exclusion rules, and many other configuration options can be customized. It even allows customization of the default response rewriting function and archival banner. The module can also be configured to only reroute a subset of the requests while letting the parent ServiceWorker script deal with the rest. For more details read the user documentation, example usage (registration process and sample ServiceWorker), or heavily documented module code.

Archival Banner

Reconstructive module has implemented a custom element named <reconstructive-banner> to provide an archival banner functionality. The banner element utilizes Shadow DOM to prevent any styles from the banner to leak into the page or the other way. Banner inclusion can be enabled by setting the showBanner configuration option to true when initializing Reconstructive module after which it will be added to every navigational page. Unlike many other archival banners in use, it does not use an iframe or stick to the top of the page. It floats at the bottom of the page, but goes out of the way when not needed. The banner element is currently in its early stage with very limited information and interactivity, but it is intended to be evolved in to a more functional component.

<script src="https://oduwsdl.github.io/Reconstructive/reconstructive-banner.js"></script>
<reconstructive-banner urir="http://example.com/" datetime="20180106175435"></reconstructive-banner>

Limitations

It is worth noting that we rely on some fairly new web APIs that might not have a very good and consistent support across all browsers and may potentially change in future. At the time of writing this post ServiceWorker support is available in about 74% active browsers globally. To help the server identify whether a request is coming from Reconstructive (to provide fallback of server-side rewriting), we add a custom request header X-ServiceWorker.

As per current specifications, there can be only one service worker active on a given scope. This means, if an archived page has its own ServiceWorker, it cannot work along with Reconstructive. However, in usual web apps ServiceWorkers are generally used for better user experience and gracefully degrade to remain functional (this is not guaranteed though). The best we can do in this case is to rewrite every ServiceWorker registration code (on client-side) in any archived page before serving the response to disable it so that Reconstructive continues to work.

Conclusions

We conceptualized an idea, experimented with it, published a peer-reviewed paper on it, implemented it in a more production-ready fashion, used it in a novel archival replay system, and made the code publicly available under the MIT License. We also released a test suite ACRTS that can be useful by itself. This work is supported in part by NSF grant III 1526700.

Resources

Reconstructive (code repository)
ACRTS (code repository)
IPWB (code repository)
MemGator (code repository)
Client-side Reconstruction of Composite Mementos Using ServiceWorker (paper)
Avoiding Zombies in Archival Replay Using ServiceWorker (presentation slides)

Update [Jan 22, 2018]: Updated usage example after converting the module to an ES6 Class and updated links after changing the repo name from "reconstructive" to "Reconstructive".

--
Sawood Alam

Tuesday, December 8, 2015

2015-12-08: Evaluating the Temporal Coherence of Composite Mementos

When an archived web page is viewed using the Wayback Machine, the archival datetime is easy to determine from the URI and the Wayback Machine's display. The archival datetime of embedded resources (images, CSS, etc.) is another story. And what stories their archival datetimes can tell. These stories are the topic of my recent research and Hypertext 2015 publication. This post introduces composite mementos, the evaluation of their temporal (in-)coherence, provides an overview of my research results.

What is a composite memento?

A Memento is an archived copy of web resource (RFC 7089) The datetime when the copy was archived is called its Memento-Datetime. A composite memento is a root resource such as an HTML web page and all of the embedded resources (images, CSS, etc.) required for a complete presentation. Composite mementos can be thought of as a tree structure. The root resource embeds other resources, which may themselves embed resources, etc. The figure below shows this tree structure and a composite memento of the ODU Computer Science home page as archived by the Internet Archive on 2005-05-14 01:36:08 GMT. Or does it?

Hints of Temporal Incoherence

Consider the following weather report that was captured 2004-12-09 19:09:26 GMT. The Memento-Datetime can be found in the URI and the December 9, 2004 capture date is clearly visible near the upper right. Look closely at description of Current Conditions and the radar image. How can there be no clouds on the radar when the current conditions are light drizzle? Something is wrong here. We have encountered temporal incoherence. This particular incoherence is caused by inherent delays of the capture process used by Heritrix and other crawler-based web archives. In this case, the radar image was captured much later (9 months!) than the web page itself. However, there is no indication of this condition.

A Framework for Evaluating Temporal Coherence

In order to study temporal coherence of composite mementos, a framework was needed. The framework details a series of patterns describing the relationships between root and embedded mementos and four coherence states. The four states and sample patterns are described below. The technical report describing the framework is available on arXiv.

Prima Facie Coherent

An embedded memento is prima facie coherent when evidence shows that it existed in its archived state at the time the root was captured. The figure below illustrates the most common case. Here the embedded memento was captured after the root but modified before the root. The importance of Last-Modified is discussed in my previous post on the importance of header replay.

Possibly Coherent

An embedded memento is possibly coherent when evidence shows that it might have existed in its archived state at the time the root was captured. The figure below illustrates this case. Here the embedded memento was captured before the root.

Probably Violative

An embedded memento is probably violative when evidence shows that it might not have existed in its archived state at the time the root was captured. The figure below illustrates this case. Here the embedded memento was captured after the root, but its Last-Modified datetime is unknown.

Prima Facie Violative

An embedded memento is probably violative when evidence shows that it did not exist in its archived state at the time the root was captured. The figure below illustrates this case. Here the embedded memento was captured after the root and was also modified after the root.

Only One in Five Archived Web Pages Existed as Presented

Using the framework, we evaluated the temporal coherence of 82,425 composite mementos. These contained 1,623,127 embedded URIs, of which 1,332,993 were available in a web archive. Composite mementos were recomposed using single and multiple archives and two heuristics: minimum distance and bracket.

Single and multiple archives: Composite mementos were recomposed from single and multiple archives. For single archives, all embedded mementos were selected from the same archive as the root. For multiple archives, embedded mementos were selected from any of the 15 archives included in the study.

Heuristics: The minimum distance (or nearest) heuristic selects between multiple captures for the same URI by choosing the memento with the Memento-Datetime nearest to the root's Memento-Datetime, and can be either before or after the root's. The bracket heuristic also takes Last-Modified datetime into account. When a memento's Last-Modified datetime and Memento-Datetime "bracket" the root's Memento-Datetime (as in Prima Facie Coherent above), it is selected even if it is not the closest.

We found that only 38.7% of web pages are temporally coherent and that only 17.9% (roughly 1 in 5) of web pages are temporally coherent and can be fully recomposed (i.e., they have no missing resources).

The paper can be downloaded from the ACM Digital Library or from my local copy. The slides from the Hypertext'15 talk follow.

Only One Out of Five Archived Web Pages Existed as Presented from ScottAinsworth

One last thing: I would like to thank Ted Goranson for presenting the slides at Hypertext 2015 when we could not attend.

-- Scott G. Ainsworth

Saturday, March 1, 2014

2014-03-01 Domains per page over time

A few days ago, I read an interesting blog post by Peter Bengtsson. Peter is sampling web pages and computing basic statistic on the number of domains (RFC 3986 Host) required to completely render the page. Not surprisingly, the mean is quite high: 33. Also not surprisingly, he has found pages that depend on more than 100 different domains.

This started me thinking about how this has changed over time. Over the course of my research I have acquired a corpus of composite mementos (archived web pages and all their embedded images, CSS, etc.) dating from 1996 to 2013. So, I did a little number crunching. What I suspected and confirmed is that the number of domains has increased over time and that the rate of increase has also increased. This is reflected in the median domains data show in Figure 1.

Note that the median shown (3) is a fraction of Peter's (25). I believe there are two major reasons for this. First, our current process for recomposing composite mementos from web archives does not run JavaScript, thus it only finds static URIs. Second, Peter's sample appears to be heavy on media sites, which tend to aggregate information, social media, and advertising from a multitude of other sites. On the other hand, our sample of 4,000 URIs and 82,425 composite mementos might be larger than Peter's sample and is probably more diverse. This difference is immaterial; direct comparability with Peter's results is not required to examine change in domains over time.

Figure 1 also shows the median resources (more precisely, the median unique URIs), required to recompose composite mementos. Median resources also increased over time. Furthermore, as shown in Figure 2, domains clearly increase as resources increase. This correlation seems to weaken as the number of resources increases. Note, however, that above 250 resources, the data is quite thin.

Another question that comes to mind is "what is the occurrence frequency of composite mementos at each resource level?" Figure 3 show an ECDF for the data. Although it is hard to tell from the figure, 99% (81,387) of our composite mementos use 100 resources or less and 90% use 43 or less. Indeed, only 34 (0.0412%) have more than 300 resources.

Also interesting is the distribution of composite mementos with respect to number of domains, which is shown in Figure 4. Here 97.5% of our composite mementos use at most 10 domains. It only takes 14 domains to cover 99% or our composite mementos.

Clearly, the number of resources and domains per web page has increased over time and the rate of increase has accelerated over time. These results are not directly comparable to Peter Bengtsson's, but I suspect were he to use a 17-year sample the same patterns would emerge. I was half tempted to plug the 4,000 URIs from our sample into Peter's Number of Domains page to see what happens, unfortunately I don't have the time available. Still, the results would be very interesting.

—Scott G. Ainsworth

March 2 update: minor grammatical corrections.

Web Science and Digital Libraries Research Group