Recently I was parsing the ACM DL metadata in XML files. I thought parsing XML is a very straightforward job provided that Python has been there for a long time with sophisticated packages such as BeautifulSoup and lxml. But I still encountered some problems and it took me quite a bit of time to figure how to handle all of them. Here, I share some tricks I learned. They are not meant to be a complete list, but the solutions are general so they can be used as the starting points to handle future XML parsing jobs.
CDATA. CDATA is seen in the values of many XML fields. CDATA means Character Data. Strings inside the CDATA section are not parsed. In other words, they are kept as what they are, including marksups. One example is <script> <![CDATA[ <message> Welcome to TutorialsPoint </message> ]] > </script >
Encoding. Encoding is a pain in text processing. The problem is that there is no way to know what the encoding the text is before opening it and reading it (at least in Python). So we must sniff it by trying to open and read the file using an encoding. If the encoding is wrong, the program usually will throw an error message. In this case, we try another possible encoding. The "file" command in Linux gives the encoding information so I know there are 2 encodings in the ACM DL XML file: ASCII and ISO-8859.
HTML entities, such as ä The only 5 built-in entities in XML are quot, amp, apos, lt and gt. So any other entities should be defined in the DTD file to show what they mean. For example, the DBLP.xml file comes with a DTD file. The ACM DL XML should have associated DTD files: proceedings.dtd and periodicals.dtd but they are not in my dataset.
The following snippet of Python code solves all the three problems above and give me the correct parsing results.
encodings = ['ISO-8859-1','ascii'] for e in encodings: try: fh = codecs.open(confc['xmlfile'],'r',encoding=e) fh.seek(0) except UnicodeDecodeError: logging.debug('got unicode error with %s, trying a different encoding' % e) else: logging.debug('opening the file with encoding: %s' % e) break
f = codecs.open('xmlfile',encoding=e) soup = BeautifulSoup(f.read(),'html.parser') Note that we use codecs.open() instead of the Python built-in open(). And we open the file twice, the first time only to check the encoding, and the second time the whole file is pass to a handle before it is parsed by BeautifulSoup. I found that BeautifulSoup is better to handle XML parsing than lxml, not just because it is easier to use but also because you are allowed to pick the parser. Note I choose the html.parser instead of the lxml parser. This is because the lxml parser is not able to parse all entries (for some unknown reason). This is reported by other users on stackoverflow.
It is hard to believe that the time has come for me to write a wrap up blog about the adventure that was my Masters
Degree and the thesis that got me to this point.
If you follow this blog with any regularity you may remember two posts, written by myself, that were the genesis of
my thesis topic:
Bonus points if you can guess the general topic of the thesis from the titles of those two blog posts. However, it
is ok
if you can not as I will give an oh so brief TL;DR;.
The replay problems with cnn.com were, sadly, your typical here today gone tomorrow replay issues
involving this little thing, that I have come to
,
known as JavaScript. What we also found out, when replaying mementos of cnn.com from the major web archives,
was each web archive has their own unique and subtle variation of this thing called "replay".
The next post about the curious case of mendely.com user pages (A State Of Replay) further confirmed that to us.
We found that not only does there exist variations in how web archives perform URL rewriting
(URI-Rs
URI-Ms)
but also that, depending on the replay scheme employed, web archives are also modifying the JavaScript
execution environment of the browser and the archived JavaScript code itself beyond URL rewriting!
As you can imagine this left us asking a number of questions that lead to the
realization that the web archiving lacks the terminology required to effectively describe the existing styles of
replay and the modifications made to an archived web page and its embedded resources in order to
facilitate replay.
Thus my thesis was born and is titled "To Relive The Web: A Framework For The Transformation And Archival Replay Of
Web Pages".
Since I am known around the ws-dl headquarters for my love of deep diving into the secrets of
(securely) replaying
JavaScript, I will keep the length of this blog post to a minimum.
The thesis can be broken down into three parts, namely Styles Of Replay, Memento Modifications, and Auto-Generating
Client-Side Rewriters.
For more detail information about my thesis, I have embedded my defense's slides below
and the full text of the thesis has been made available.
Styles Of Replay
The existing styles of replaying mementos from web archives is broken
down into two distinct models, namely "Wayback" and "Non-Wayback", and each has its own distinct styles.
For the sake of simplicity and length of this blog post I will only (briefly) cover the replay styles
of the "Wayback" model.
Non-Sandboxing Replay
Non-sandboxing replay is the style of replay that does not separate the replayed memento from
the archive-controlled portion of replay, namely the banner.
This style of replay is considered the OG (original gangster) way for replaying mementos simply
because it was, at the time, the only way to replay mementos and was introduced by the
Internet Archive's Wayback Machine.
To both clarify and illustrate what we mean by "does not separate the replayed memento
from archive-controlled portion of replay", consider the image below displaying the HTML and frame tree for a
http://2016.makemepulse.com memento replayed from the Internet Archive
on October 22, 2017.
As you can see from the image above, the archive's banner and the memento exist together on the same domain
(web.archive.org).
Implying that the replayed memento(s) can tamper with the banner (displayed during replay)
and or interfere with archive control over replay.
Non-malicious examples of mementos containing HTML tags that can both tamper with the
banner and interfere with archive control over replay skip to the
Replay Preserving modifications section of post.
Now to address the recent claim that "memento(s) were hacked in
the archive" and its correlation to non-sanboxing replay.
Additional discussion on this topic can be found in Dr. Michael
Nelson's
blog post covering the case of
blog.reidreport.com
and in his presentation for the National Forum on Ethics and Archiving the
Web
(slides,
trip report).
For a memento to be considered (actually) hacked, the web archive the memento is replayed (retrieved)
from must be have been compromised in a manner that requires the hack to be made within the
data-stores of the archive and does not involve user initiated preservation.
However, user initiated preservation can only tamper
with a non-hacked memento when replayed from an archive.
The tampering occurs when an embedded resource, previously un-archived at the memento-datetime of the "hacked"
memento,
is archived from the future (present datetime relative to memento-datetime) and typically involves the usage of
JavaScript.
Unlike non-sandboxing replay, the next style of Wayback replay, Sandboxed Replay,
directly addresses this issue and the issues of how to securely replay archived JavaScript.
PS. No signs of tampering, JavaScript based or otherwise, were present in the blog.reidreport.com mementos
from the Library of Congress. How do I know???
Read my thesis and or look over my
thesis defense slides,
I cover in detail what is involved in the mitigation of JavaScript based memento tampering
and know what that actually looks like .
Sandboxed Replay
Sandboxed replay is the style of replay that separates the replayed memento from the
archive-controlled portion of the page through replay isolation.
Replay isolation is the usage of an iframe to the sandbox the replayed memento, replayed from a different domain,
from the archive controlled portion of replay.
Because replay is split into two different domains (illustrated in the image seen below), one for the replay of the
memento and one
for the archived controlled portion of replay (banner), the memento cannot tamper with the archives control over
replay
or the banner. Due to
security restrictions placed on web pages from different origins by the browser called the Same Origin Policy.
Web archives employing sandboxed replay typically also perform the memento modification style known as
Temporal Jailing.
This style of replay is currently employed by Webrecorder and
all web archives using Pywb
(open source, python implementation of the Wayback Machine).
For more information on the security issues involved in high-fidelity web archiving see the talk entitled Thinking like a hacker: Security Considerations for
High-Fidelity Web Archives given by Ilya Kreymer and Jack Cushman at WAC2017 (trip report), as well as, Dr.
David Rosenthal's commentary on the talk.
Memento Modifications
The modification made by web archives to mementos in order to facilitate there replay can be broken down into three
categories, the first of which is Archival Linkage.
Archival Linkage Modifications
Archival linkage modifications are made by the archive to a memento and
its embedded resources in order to serve (replay) them from the archive.
The archival linkage category of modifications are the most fundamental
and necessary modifications made to mementos by web archives simply because they prevent the
Zombie Apocalypse.
You are probably already familiar with this category of memento modifications as it is more commonly
referred to as URL rewriting (URI-R
URI-M).
<!-- pre rewritten --><linkrel="stylesheet"href="/foreverTime.css"><!-- post rewritten --><linkrel="stylesheet"href="/20171007035807cs_/foreverTime.css">
URL rewriting (archival linkage modifications) ensures that you can relive (replay) mementos, not from the live web,
but from the archive.
Hence the necessity and requirement for this kind of memento modifications.
However, it is becoming necessary to seemingly
damage
mementos in order to simply replay them.
Replay Preserving Modifications
Replay Preserving Modifications are modifications made by web archives to specific
HTML element and attribute pairs in order to negate their intended semantics.
To illustrate this, let us consider two examples, the first of which was introduced by our fearless leader
Dr. Michael Nelson
and is known as the zombie introducing
meta refresh tag shown
below.
As you are familiar, the meta refresh tag will, after 35 seconds, refresh the page with the "?zombie=666" appended
to original URL.
When a page containing this dastardly tag is archived and replayed,
the results of the refresh plus appending "?zombie=666" to the URI-M
causes the browser to navigate to a new URI-M that was never archived.
To overcome this archives must arm themselves with the attribute prefixing shotgun
in order to negate the tag and attribute's effects.
A successful defense against the zombie invasion when using the attribute prefixing shotgun is shown
below.
Now let me introduce to you a new more insidious tag that does not introduce a zombie into replay
but rather a demon known as the meta csp tag, shown below.
Naturally, web archives do not want web pages to be delivering their own Content-Security-Policies via
meta tag because the results are devastating, as shown by the YouTube video below.
Readers have no fear, this issue is fixed!!!! I fixed the meta csp issue for Pywb and
Webrecorder in pull request #274
submitted to Pywb. I also reported this to the Internet Archive and they promptly got around
to fixing it.
Temporal Jailing
The final category of modifications, known as temporal Jailing,
is the emulation of the JavaScript environment as it existed at the original memento-datetime
through client-side rewriting.
Temporal jailing ensures both the secure replay of JavaScript and that JavaScript can not tamper
with time (introduce zombies) by applying overrides to the JavaScript APIs provided by the browser
in order to intercept un-rewriten urls.
Yes there is more to it, a whole lot more, but because it involves replaying JavaScript and I am
attempting to keep this blog post reasonably short(ish),
I must force you to consult my thesis
or thesis defense slides
for more specific details.
However, for more information about the
impact of JavaScript on archivability, and measuring the
impact of missing resources see Dr. Justin Brunelle'sPh.D. wrap up blog post.
The technique for the secure replay of JavaScript known as temporal jailing is currently used by Webrecorder and
Pywb.
Auto-Generating Client-Side Rewriters
Have I mention yet just how much I
JavaScript??
If not, lemme give you a brief overview of how I am auto-generating client-side rewriting libraries,
created a new way to replay JavaScript (currently
used in production by Webrecorder and Pywb)
and increased the replay fidelity of the Internet Archive's Wayback Machine.
First up let me introduce to you Emu:
Easily
Maintained Client-Side
URL Rewriter (GitHub).
Emu allows for any web archive to generate their own generic client-side rewriting library, that
conforms to the de facto standard implementation Pywb's wombat.js, by supplying it the Web IDL definitions for the JavaScript APIs of the browser.
Web IDL was created by the W3C to describe interfaces intended to be
implemented in web browser,
allow the behavior of common script objects in the web platform to be
specified more readily,
and provide how interfaces described with Web IDL correspond to
constructs within ECMAScript execution environments.
You may be wondering how can I guarantee this tool will generate a client-side rewriter that provides complete
coverage of the JavaScript APIs of the browser and that we can readily obtain these
Web IDL definitions?
My answer is simple and it is to confider the following excerpt from the
HTML specification:
This specification uses the term document to refer to any
use of HTML, ...,
as well as to fully-fledged interactive applications. The term is used
to refer both to Document objects and their descendant DOM trees, and
to serialized byte streams using the HTML syntax or the XML syntax,
depending on context ... User agents that support scripting must also be conforming implementations
of the IDL fragments in this specification, as described in the Web IDL
specification
Pretty cool right, what is even cooler is that a good number of your major browsers/browser engines (Chromium,
FireFox, and
Webkit) generate
and make publicly available Web IDL definitions representing the browsers/engines conformity to the specification!
Next up a new way to replay JavaScript.
Remember the curious case of mendely.com user pages
(A State Of Replay) and
how we found out that Archive-It, in addition to applying archival linkage modifications, was rewriting
JavaScript code to substitute a new foreign, archive controlled, version of the JavaScript APIs it was targeting.
This is shown in the image below.
Archive-It rewriting embedded JavaScript from the memento for
the curious case mendely.com user pages
Hmmmm, looks like Archive-It is only rewriting only two out of four instances of the text string location in the
example shown above.
This JavaScript rewriting was targeting the
Location interface which controls the
location of the browser.
Ok, so how well would Pywb/Webrecorder do in this situation??
From the image shown below, not as good and maybe a tad bit worse...
Because the documentation site for React Router
was bundling HTML inside of JavaScript containing the text string "location" (shown above),
the rewrites were exposed in the documentations HTML displayed to page viewers (second image above).
In combination with how Archive-It is also rewriting archived JavaScript, in a similar manner,
I was like this needs to be fix.
And fix it I did. Let me introduce to you a brand new way of replaying archived JavaScript shown below.
The native JavaScript Proxy object
allows an archive to perform runtime reflection on the proxied object.
Simply put, it allows an archive to defined custom or restricted behavior for the proxied object.
I have annotated the code snippet above with additional information about the particulars of how archives can use
the Proxy object.
Archives using the JavaScript Proxy object in combination with the setup shown below,
web archives can guarantee the secure replay of archived JavaScript and do not have to perform the kind of rewriting
shown above. Yay! Less archival modification of JavaScript!!
This method of replaying archived JavaScript was merged into Pywb on
August 4, 2017 (contributed by yours truly) and has been
used in production by Webrecoder since
August 21,
2017.
Now to tell you about how I increased the replay fidelity of the Internet Archive and how you can too
.
Ok so I generated a client-side rewriter for the Internet Archive's Wayback Machine using the code that is now
Emu and crawled 577 Internet Archive mementos from the top 700
web pages found in the
Alexa top 1 million web site list circa June 2017.
The crawler I wrote for this can be found on GitHub
.
By using the generated client-side rewriter I was able to increase the cumulative number of requests made by the
Internet Archive mementos
by 32.8%, a 45,051 request increase (graph of this metric shown below).
Remember that each additional request corresponds to a resource that previously was unable to be
replayed from the Wayback Machine.
Hey look, I also decreased the number of requests blocked by the content-security policy of the Wayback Machine by
87.5%, a 5,972 request increase (graph of this metric shown below).
Remember, that earch request un-blocked corresponds to a URI-R the Wayback Machine could not rewrite server-side
and requires the usage of client-side rewriting (Pywb and Webrecorder are using this technique already).
Now you must be thinking this impressive to say the least, but how do I know these numbers are not faked / or
doctored in some way in order to give a client-side rewriting the advantage???
Well you know what they say seeing is believing!!! The generated client-side rewriter used in the crawl that
produced the numbers shown to you today is available as the Wayback++
Chrome and
Firefox browser extension!
Source code for it is on GitHub as well.
And oh look, a video demonstrating the increase in replay fidelity gained if the Internet Archive were to use
client-side. Oh, I almost forgot to mention that at the 1:47 mark
in the video I make mementos of cnn.com replayable
again from the Internet Archive.
Winning!!
Pretty good for just a masters thesis wouldn't you agree. Now it's time for the obligatory list of all the
things I have created in the process of this research and time as a masters student:
Squidwarc: A high fidelity archival crawler that uses Chrome
or Chrome Headless
(blog post)
MS Thesis Crawler: The Chrome or Chrome Headless based
crawler
written for my thesis's evaluation to specifically crawl the Internet Archive
Wayback++: A Chrome
and
Firefox browser extension that
brings client-side rewriting to the Internet Archive's Wayback Machine
Emu: Easily
Maintained Client-Side
URL Rewriter. Generate a client-side rewriter from Web IDL
Memgator Bulk TimeMap Downloader:
Have you ever had a need to download 100 or 1 million TimeMaps using Memgator? With the caveat that it must be done
in a timely manner? If so then you are in luck because this project has you covered.
Grad School Python Utils:
A python 3 utility belt containing a collection of reusable python code to aid grad students in getting through
grad school
Latex Toolbox: Librarification of the setup and macros
used in the Latex src of my masters thesis
User-Agent Lists: 12,296 User-Agent strings for
research
purposes only of course
nukeBloggerClickTrap.js: A
small,
embeddable, and vanilla JavaScript code that will remove the annoying blogger click trap element when previewing
your
blog post
What is next you may ask???
Well I am going to be taking a break before I start down the path known as a Ph.D.
Why???????
To become the senior backend developer for Webrecorder of course!
There is so so much to be learned from actually getting my hands dirty in facilitating high-fidelity web archiving
such that when I return, I will have a much better idea of what my research's focus should be on.
If I have said this once, I have said this a million times.
When you use a web browser in the preservation process, there is no such thing as an un-archivable web page!
Long live high-fidelity web archiving!
The USA Gymnastics team shows significant growth during the years the Olympics are held.
Due to Twitter's API, we have limited ability to collect historical data for a user's followers. The information for when one account starts following another is unavailable. Tracking the popularity of an account and how it grows cannot be done without that information. Another pitfall is when an account is deleted, Twitter does not provide data about the account after the deletion date. It is as if the account never existed. However, this information can be gathered from the Internet Archive. If the account is popular enough to be archived, then a follower count for a specific date can be collected.
The previous method to determine followers over time is to plot the users in the order the API returns them against their join dates. This works on the assumption that the Twitter API returns followers in the order they started following the account being observed. The creation date of the follower is the lower bound for when they could have started following the account under observation. Its correctness is dependent on new accounts immediately following the account under observation to get an accurate lower bound. The order Twitter returns followers is subject to unannounced change, so it can't be depended on to work long term. That will not show when an account starts losing followers, because it only returns users still following the account. This tool will help accurately gather and plot the follower count based on mementos, or archived web pages, collected from the Internet Archive to show growth rates, track deleted accounts, and help pinpoint when an account might have bought bots to increase follower numbers.
I improved on a Python script, created by Orkun Krand, that collects the followers for a specific Twitter username from the mementos found in the Internet Archive. The code can be found on Github. Through the historical pages kept in the Internet Archive, the number of followers can be observed for a specific date of the collected memento. This script collects the follower count by identifying various CSS Selectors associated with the follower count for most of the major layouts Twitter has implemented. If a Twitter page isn't popular enough to warrant being archived, or too new, then no data can be collected on that user.
This code is especially useful for investigating users that have been deleted from Twitter. The Russian troll @Ten_GOP, impersonating the Tennessee GOP was deleted once discovered. However, with the Internet Archive we can still study its growth rate while it was active and being archived.
In February 2018, there was an outcry as conservatives lost, mostly temporarily, thousands of followers due to Twitter suspending suspected bot accounts. This script enables investigating users who have lost followers, and for how long they lost them. It is important to note that the default flag to collect one memento a month is not expected to have the granularity to view behaviors that typically happen on a small time frame. To correct that, the flag [-e] to collect all mementos for an account should be used. The republican political commentator @mitchellvii lost followers in two recorded incidences. In January 2017 from the 1st to the 4th, @mitchellvii lost 1270 followers. In April 2017 from the 15th to the 17th, @mitchellvii lost 1602 followers. Using only the Twitter API to collect follower growth would not show this phenomenon.
The program will create a folder named <twitter-username-without-@>. This folder will contain two .csv files. One, labeled <twitter-username-without-@>.csv, will contain the dates collected, the number of followers for that date, and the URL for that memento. The other, labeled <twitter-username-without-@>-Error.csv, will contain all the dates of mementos where the follower count was not collected and will list the reason why. All file and folder names are named after the Twitter username provided, after being cleaned to ensure system safety.
If the flag [-g]is used, then the script will create an image <twitter-username-without-@>-line.pngof the data plotted on a line chart created by the follower_count_linechart.R script. An example of that graph is shown as the heading image for the user @USAGym, the official USA Olympic gymnastics team. The popularity of the page changes with the cycle of the Summer Olympics, evidenced by most of the follower growth occurring in 2012 and 2016.
Example Output:
./FollowerHist.py -g -p USAGym
USAGym
http://web.archive.org/web/timemap/link/http://twitter.com/USAGym
242 archive points found
20120509183245
24185
20120612190007
...
20171221040304
250242
20180111020613
250741
Not Pushing to Archive. Last Memento Within Current Month.
null device
1
cd usagym/; ls
usagym.csv usagym-Error.csv usagym-line.png
How it works:
$ ./FollowerHist.py --help
usage: FollowerHist.py [-h] [-g] [-p | -P] [-e] uname
Follower Count History. Given a Twitter username, collect follower counts from
the Internet Archive.
positional arguments:
uname Twitter username without @
optional arguments:
-h, --help show this help message and exit
-g Generate a graph with data points
-p Push to Internet Archive
-P Push to all archives available through ArchiveNow
-e Collect every memento, not just one per month
First, the timemap, the list of all mementos for that URI, is collected for http://twitter.com/username. Then, the script collects the dates from the timemap for each memento. Finally, it dereferences each memento and extracts the follower count if all the following apply:
A previously created .csv of the name the script would generate does not contain the date.
The memento is not in the same month as a previously collected memento, unless [-e] is used.
The page format can be interpreted to find the follower count.
The follower count number can be converted to an Arabic numeral.
A .csv is created, or appended to, to contain the date, number of followers, and memento URI for each collected data point.
A error .csv is created, or appended, with the date, number of followers, and memento URI for each data point that was not collected. This will contain repeats if run repeatedly because it will not delete the old entries while writing the new errors in.
If the [-g] flag is used, a .png of the line chart will be created "<twitter-username-without-@>-line.png".
If the [-p] flag is used, the URI will be pushed to the Internet Archive to create a new memento if there is no current memento.
If the [-P] flag is used, the URI will be pushed to all archives available through archivenow to create new mementos if there is no current memento in Internet Archive.
If the [-e] flag is used, every memento will be collected instead of collecting just one per month.
As a note for future use, if the Twitter layout undergoes another change, the code will need to be updated to continue successfully collecting data.
Special thanks to Orkun Krand, whose work I am continuing.