HTML Resource Packages

Editor: Justin Lebar <justin.lebar@gmail.com>

Introduction

HTML resource packages help make webpages load faster by reducing the number of HTTP requests necessary to load a document.

A document specifies that it's using a resource package by adding a packages attribute to its <html> element. User agents which implement this specification will then try to load resources for that document out of the archive specified in the attribute instead of issuing separate HTTP requests for each resource.

This document is a formal specification of HTML resource packages. It is structured as a set of modifications to the HTML5 specification.

HTML resource packages were originally proposed by Alexander Limi in November 2009. This specification is largely a formalization of the original proposal, although it changes a few details and adds many others.

Status of this document

This document is currently a draft proposal. It's not endorsed by any standards body, and it may undergo large changes in response to feedback.

Please send comments to justin.lebar@gmail.com.

Intended audience

The primary audience of this document is web browser developers. As such, the focus of this document is on how to implement resource packages in a browser, not how to use resource packages effectively within a web page.

This document is a set of modifications to the HTML5 specification, and as such it assumes familiarity with basic concepts from that document, such as HTTP and the DOM.

Example

Below is a simple HTML document which makes use of resource packages:

<!DOCTYPE HTML>
<html packages='[pkg1.zip img1.png script.js styles/style.css]
                static/pkg2.zip'>
<head>
  <link rel='stylesheet' href='styles/style.css'>
</head>
<body>
  <script src='script.js'>
  <img src='img1.png'>
  <img src='static/img2.png'>
  <img src='img3.png'>
</body>
</html>

A browser which does not understand resource packages will ignore the packages attribute on the <html> element and load the page normally.

A browser which is aware of resource packages will download pkg1.zip and pkg2.zip. The browser will first examine pkg2.zip, since resource packages which appear later in the list override packages which appear earlier.

Since it's located inside the static directory, pkg2.zip can only serve files within that directory. If pkg2.zip contains a file named img2.png, the browser will use that file to display static/img2.png in the document.
If not, the browser will request static/img2.png from the server.

The browser will then try to extract the files img1.png, script.js, and styles/style.css from pkg1.zip and use those files in the document instead of downloading them individually. If pkg1.zip is missing any of those files, the browser will download the missing file from the server using a separate HTTP request.

Since img3.png cannot possibly be served by either pkg1.zip or pkg2.zip, the browser doesn't need to wait until the resource packages have finished downloading to start downloading img3.png from the server.

Conventions

This document uses the following typographic conventions:

The packages attribute

This specification adds a new content attribute named packages to HTML5's definition of the html element.

The packages attribute gives the locations and possibly the contents of one or more resource packages.

Whenever the packages attribute on the root <html> element is changed (including when the document is first loaded, if the root <html> element has a packages attribute), the UA must run the parse the packages attribute algorithm.

This interacts badly with base tags. When we first load the document, we haven't seen any base tags yet, so we resolve the packages relative to the document's URL. Suppose our document has a base tag and we want to programmatically add a resource package. When we modify the packages attribute, we'll re-parse the whole thing and re-resolve all the links relative to the new base.

But suppose we said that resource packages were always resolved relative to the document's URL. Then we'd still have the same problem if the page used pushState to change its directory.

ZIP File Semantics

UAs must interpret zip files used as resource packages in a manner different from the official zip specification.

The purpose of these changes is to allow the UA to extract a zip file before the whole file has finished downloading. At issue is the fact that a zip file may contain multiple copies of a single file. Which copy is authoritative is indicated in a directory at the end of the file.

This feature was perhaps useful in the days of floppy disks and 14k modems because it allows a user to modify a small file inside a large archive simply by adding the new copy of the file to the end of the archive and updating the directory, but it has little value for our purposes.

UAs need to perform three operations on a zip archive in order to implement this specification. The semantics for these operations are described below, using terminology from section V of the zip specification:

Algorithms

This section defines the algorithms which a user agent supporting resource packages must implement.

Parse the packages attribute

This algorithm takes as input the value of the packages attribute on the document's root <html> element and constructs a list of package objects which it stores in the document.

A package object describes a single resource package. It has three fields:

We say that the resource pointed to by a package object's package href is its resource package.

The algorithm works as follows:

  1. Initialize an empty list of package objects.

  2. Let s be the string containing the value of the root <html> element's packages attribute.

  3. If s is empty, set the document's list of package objects to the empty list and exit this algorithm.

  4. Initialize a pointer i to the first character of s.

  5. While i does not point past the end of s, do the following:

    1. Skip whitespace.

    2. If i points to a "[" character (U+005B), do the following:

      1. Advance i forward one character.

      2. If i points past the end of s, exit from this loop.

      3. Collect a sequence of characters that are not "]" (U+005D) and split the resulting string on spaces. Let the resulting list be token list.

      4. If i does not point past the end of s, advance i forward one character.

      5. If token list has at least one element, do the following:

        1. Let u be token list's first element resolved relative to the document's base URL. If this fails, continue on to the next iteration of the loop.

        2. Create a new package object whose package href is u, whose wildcard flag is set to false, and whose contents are the remaining elements in the list. Add this object to the list of package objects.

    3. Otherwise, if i doesn't point to a "[" character, do the following:

      1. Collect a sequence of characters that are not whitespace and not "[".

      2. Let u be the resulting string resolved relative to the document's base URL. If this fails, continue on to the next iteration of the loop.

      3. Create a new package object whose package href is u, which has wildcard set to true, and which has an empty contents list. Add this object to the list of package objects.

  6. Let the document's list of package objects be the list of package objects constructed in this algorithm.

A UA may begin downloading resource packages immediately after this algorithm finishes, or it may begin downloading them on demand. We recommend fetching immediately so as to minimize page load times.

It's important that UAs store the package object's package href as an absolute URL instead of storing a relative URL and resolving the package href when a resource is requested from the package.

Otherwise, a page, say http://evil.com, could do the following:

  • Specify a resource package as

    <html packages='pkg.zip'>
    
  • Cause the package to be downloaded. At the time of download, the package's absolute URL is http://evil.com/pkg.zip.

  • Change the page's base URL to http://bank.com.

Now when the page requests the resource at http://bank.com/logo.png, the (incorrect) algorithm would resolve the package href relative to the new base, http://bank.com, even though the package was downloaded from http://evil.com. Thus the UA would incorrectly allow http://bank.com/logo.png to be fetched from the package, and the user, upon asking her UA where the resource was downloaded from, might incorrectly believe that it came from http://bank.com instead of http://evil.com.

Fetching an absolute URL

The user agent must run this algorithm immediately before the main step of the HTML5 resource fetching algorithm for all requests for a resource within a document.

Define "within a document".
  1. Abort this algorithm and continue with the HTML5 resource fetching algorithm if any of the following are true:

    • the resource is to be obtained using a non-idempotent action (e.g. an HTTP POST),

    • the requested resource is not identified by an absolute URL (see note below), or

    • the requested resource is itself a resource package.

  2. Iterate in reverse order over the list of package objects created the last time we parsed the packages attribute. For each package object p, do the following:

    • Try to fetch the request from p.

    • If the call above succeeded, exit this algorithm and continue immediately after the main step of the HTML5 resource fetching algorithm.

  3. If we did not exit the algorithm during the loop above, execute the main step of the HTML5 resource fetching algorithm.

We apply this algorithm before we check the user agent's cache to see if it contains the requested resource. Thus a UA must not fetch a resource from its cache unless it cannot fetch that resource from any of the document's resource packages.

Of course, the UA is free to cache the resource package itself.

A user agent may cache copies of resources extracted from resource packages, but these cache entries must be kept separate from the UA's regular cache, and use of these cached copies must obey the semantics of the algorithm above.

In particular, if a resource package has expired from the UA's cache, the UA must not use cached copies of files extracted from that resource package to fulfill requests within a document.

In general, any resource which the user agent can retrieve over HTTP will satisfy the condition that the resource is identified by an absolute URL. It's not necessary that the resource be specified in the document markup with an absolute URL in order for the resource to meet the condition.

For example, a request to load the resource specified by

  <img src="foo.jpg">

would satisfy the condition because the absolute URL identifying "foo.jpg" can be computed by resolving "foo.jpg" relative to the document's base URL.

Try to fetch a URL from a package object

This algorithm explains how to attempt to satisfy a request for a resource identified by the absolute url url from a package object p. The algorithm returns either "success" or "failure".

When this algorithm is run, p's resource package may not yet have been requested, may be partially downloaded, or may be fully downloaded.

If at any point in this algorithm the UA encounters an error while trying to extract a file from a resource package (e.g. an internal checksum fails), the algorithm returns "failure".

  1. If the UA has received the full HTTP header for p's resource package and that header either does not specify a content-type for the package or specifies a content-type which is not supported by the UA, return "failure".

    Conforming UAs must support resource packages contained in zip files (MIME type application/zip), but UAs may support other formats in addition to zip.

    The extension of the resource package's filename is immaterial here. If a UA receives a resource package whose name ends with ".zip" but which was delivered without a content-type header, the UA must ignore that package.

    Similarly, if a UA receives a resource package whose filename doesn't end with ".zip" but which was delivered with content-type application/zip, the UA must treat the resource package as though it were a ZIP archive.

    If we don't return "failure" here and proceed beyond this step, it's meaningful for us to talk about extracting parts of the resource package which have downloaded, since we either haven't finished receiving the HTTP header and thus have no data to extract, or the package has a format the UA supports.

  2. Let rel be the path of url within p's href. If this fails, return "failure".

  3. Set content to p's list of contents.

  4. Decode each element of content according to the percent-encoding scheme defined in RFC 3986, considering the escape sequences to be encoding UTF-8 sequences. If this fails for any element in content, remove it from the set.

  5. If elem's resource package has finished downloading to the point that the user agent can extract from the package a complete set of the files contained in the package, do the following:

    • If p's wildcard field is true, set content to the set of files in the archive.

    • Otherwise, set content to the intersection of content and the set of files contained in the package.

  6. If p's wildcard flag is set to false and rel is not contained in content (according to a case-sensitive search of content), return "failure".

    A UA must interpret the directory names "." and ".." in rel literally, rather than as a reference to the current and parent directories.

    RFC 3986 specifies that the directories "." and ".." do refer to the current and parent directories for the purposes of resolving URLs. Since a UA resolves a URL before it fetches it, this means that a UA will never request a URL explicitly containing a directory named "." or "..". Thus a UA will never load from a resource package a file which lives under a directory named "." or ".." in the package.

  7. If the UA has not started to download p's resource package file, run the obtaining a resource algorithm for that file. The request made by that algorithm must be asynchronous even if the request in this algorithm is synchronous.

  8. If this algorithm's request is synchronous, wait for the resource package to download to the point that either we can extract the file corresponding to rel from the package or we know that a file corresponding to rel cannot be contained in the package.

    If we can extract the file corresponding to rel, do so, and use that data to fulfill the request. Return "success".

    If a file corresponding to rel cannot be contained in the package, return "failure".

    A UA may begin fulfilling the request at any point after it has begun to receive data for rel.

    If the file format of the resource package supports it, a UA may incrementally extract rel from the resource package rather than wait for the entirety of rel to finish downloading before beginning to fulfill the request.

    Conversely, a UA may wait until the entirety of the resource package has completed before fulfilling any requests out of the resource package or use some other heuristic for determining when to begin fulfilling requests.

  9. If the request is not synchronous, do the following:

    • If p's resource package has finished downloading enough that the entry for rel is available, use that entry to fulfill the request and return "success".

      As in the previous step, a UA may begin fulfilling the request at any point after it has begun to receive data for rel.

    • If the package has finished downloading enough that we are sure that a file coresponding to rel will not be contained in the package, return "failure".

    • Otherwise, register a listener on the download of $p's resource package which will call this algorithm again when more of the package has been received and return "success".

Get the path of a URL within a package

This algorithm takes two absolute URLs, url and pkg-url, and tries to return a suffix of url which, when resolved relative to pkg-url, yields the original value of url. If no such suffix exists, the algorithm fails.

This algorithm makes use of terminology defined in RFC 3986.

  1. If the scheme and authority portions of pkg-url and url are not equivalent, return failure.

    Note a UA may apply heuristics as described in RFC 3986 to determine whether the scheme and authority are equivalent.

    For example, a UA may consider

      http://foo.com:80
    

    and

      http://FOO.com
    

    to be equivalent.

    A UA must not percent-decode URLs before testing them for equivalence, per RFC 3986.

    For example, the URLs

       http://foo.com/bar%20baz.html
    

    and

       http://foo.com/bar baz.html
    

    are not equivalent.

  2. Let pkg-path be the path portion of pkg-url, and let url-path be the path portion of url.

  3. Let pkg-dir be the prefix of pkg-path up to and including the last occurrence of "/" in pkg-path. If pkg-path does not contain the "/" character, pkg-dir is empty.

  4. If pkg-dir is not a prefix for url-path, return failure.

  5. Let suffix be the suffix of url-path which begins immediately after the end of pkg-dir within url-path.

  6. If suffix is non-empty, return suffix. Otherwise, return failure.

Acknowledgments

Thanks to Aryeh Gregor, Alexander Limi, Caroline Schermer, Ilya Sherman, Jonas Sicking, Henri Sivonen, Maciej Stachowiak, Johnny Stenback, Philip Taylor, and Boris Zbarsky for their valuable feedback on this specification.