Showing posts with label Stanford. Show all posts
Showing posts with label Stanford. Show all posts

Monday, April 17, 2017

2017-04-17: Personal Digital Archiving 2017

On March 29-30, 2017 I attended Personal Digital Archiving Conference 2017 (#pda2017) held at Stanford University in sunny Palo Alto, California. Other members of the Web Science and Digital Libraries Research Group (WS-DL) had previously attended this conference (see their 2013, 2012, and 2011 trip reports) and from their rave reviews of previous year's conferences, I was looking forward to it. I also just happened to be presenting and demoing the Web Archiving Integration Layer (WAIL) there as an added bonus.

Day 1

Day one started off at 9am with Gary Wolf giving the first keynote on Quantified Self Archives. Quantified Self Archives are comprised of data generated from health monitoring tools such as the FitBit or life blogging data which is used to gain in sites into your own life through data visualization. 
After the keynote was the first session Research Horizons moderated by WS-DL alumni, Yasmina Anwar.
The first talk of this session was Whose Life Is It, Anyway? Photos, Algorithms, and Memory (Nancy Van House, UC Berkeley). In the talk, Van House spoke on the effects of "faceless" algorithms on images and how they can distort the memory of the images they are applied to in many personal archives. Van House also spoke about how machine learning techniques when done in aggregate on images without context can have unintended consequences, especially when attempting to detect emotion. To demonstrate this, Van House showed a set of images tagged with the emotion of Joy one of which was a picture of an avatar from the online life simulator Second Life.

The second talk was Digital Workflow and Archiving in the Humanities and Social Sciences (Smiljana Antonijevic Ubois, Penn State University). Ubois spoke on the many ways scholars use non-traditional archives such as Dropbox or photos taken by their smartphones to preserve their work. One of the biggest points brought up in the talk by Ubois was that humanities and social sciences scholars still see the web as a resource rather than home to a digital archive.

The third talk was Mementos Mori: Saving the Legacy of Older Performers (Joan Jeffri, Research Center for Arts & Culture/The Actors Fund). In the talk, Jeffri spoke on the efforts being made to document and preserve the works of artists by the performing arts legacy project. The project found that one in five living artists in New York had no documentation of their work especially the older artists.
The final talk in the session was Exploring Personal Financial Information Management Among Young Adults (Robert Douglas Ferguson, McGill School of Information Studies). Douglas spoke on the passive preservation i.e usage of web portal and tools provided by financial services, done by young adults when it comes to managing their money and the need to consider long-term preservation of these materials.
Session two was Preserving & Serving PDA at Memory Institutions moderated by Glynn Edwards.
This session started off with Second-Generation Digital Archives: What We Learned from the Salman Rushdie Project (Dorothy Waugh and Elizabeth Russey Roke, Emory University). In 2010, Emory University announced the launch of the Salman Rushdie Digital Archives. This reading room kiosk offered researchers at the Manuscript, Archives, and Rare Book Library the opportunity to explore born-digital material from one of four of Rushdie’s personal computers through dual access systems. One of the biggest lessons learned noted by Waugh was the need to document everything the software engineers do as their work is just as ephemeral as the born digital information they wished to preserve.
After Waugh was Composing an Archive: the personal digital archives of contemporary composers in New Zealand (Jessica Moran, National Library of New Zealand). In recent years the Library has acquired the digital archives of a number of prominent contemporary composers. Moran discussed the personal digital archiving practices of the composer, the composition of the archive, and the work of the digital archivists, in collaboration with curators, arrangement and description librarians, and audio-visual conservators, to collect, describe, and preserve this collection.
The final talk in session two was Learning from users of personal digital archives at the British Library (Rachel Foss, The British Library). Foss discussed the efforts made by the British Library to provide access to their digital collections that require emulation to viewed. Foss disscused that arhiving professionals also need to consider how we assist and educate our researchers to make use of born-digital collections implying understanding more about how they want to interrogate these collections as a resource.

Lunch happened. Session 3 Teaching PDA moderated by Charles Ransom.
Journalism Archive Management (JAM): Preparing journalism students to manage their personal digital assets and diffuse JAM best practices into the media industry (Dorothy Carner & Edward McCain, University of Missouri). In collaboration with MU Libraries and the school’s Donald W. Reynolds Journalism Institute, a personal digital archive learning model was developed and deployed in order to prepare journalism-school students, faculty and staff for their ongoing information storage and access needs. The MU J-School has created a set of PDA best practices for journalists and branded it: Journalism Archive Management (JAM).
An archivist in the lab with a codebook: Using archival theory and “classic” detective skills to encourage reuse of personal data (Carly Dearborn, Purdue University Libraries). Dearborn designed a workshop inspired by the Society of Georgia Archivists’ personal digital archiving activities to introduced attendees to archival concepts and techniques which can be applied to familiarize researchers with new data structures.
Session 4: Emergent Technologies & PDA 1 moderated by Nicholas Taylor
Cogifo Ergo Sum: GifCities & Personal Archives on the Web (Maria Praetzellis & Jefferson Bailey, Internet Archive). In the talk Praetzellis and Bailey spoke on the gif archive GifCities created for the Internet Archives 20th anniversary which included a search interface. The GeoCities Animated GIF Search Engine, comprising over 4.6 million animated GIFs from the GeoCities web archive. Each GIF links back to the archived GeoCities web page upon which it was originally embedded. The search engine offers a novel, flabbergasting window into what is likely one of the largest aggregations of publicly-accessible archival personal documentary collections. It also provokes a reassessment of how we conceptualize personal archives as being both from the web (as historical encapsulations) and of the web (as networked recontextualization).
Comparison of Aggregate Tools for Archiving Social Media (Melody Condron). In the talk Condron spoke about many tools which could make archiving social media easier: Frostbox, If This Then That and digi.me. Of all the tools mentioned If This Then That provided the easiest way for its users to push social media into archives such Internet Archive or Webrecorder.

Video games collectors and archivists: how might private archives influence archival practices (Adam Lefloic Lebel, University of Montreal)

Demonstrations:
There were two different demonstration sessions the first was between session 4&5 and the second was at the end after session 6.
The demo for the Web Archiving Integration Layer (WAIL) consisted of two videos and myself talking to those who stopped by about the particular use cases of WAIL or answering any questoins they had about WAIL. The first is viewable below which is detailed feature walkthrough of WAIL and the second was showing off WAIL in action.
Session 5: Emergent Technologies & PDA 2 moderated by Henry Lowood

CiteTool: Leveraging Software Collections for Historical Research (Eric Kaltman, UC Santa Cruz) Kaltman spoke about how the tool is currently being used in a historical exploration of the computer game DOOM as a way to compare conditions across versions and to save key locations for future historical work. Since the tool provides links to saved locations, it is also possible to share states amongst researchers in collaborative environments. The links also function as an executable citation in cases where an argument about a program’s functionality is under discussion and would benefit from first-hand execution.


Applying technology of Scientific Open Data to Personal Closed Data (Jean-Yves Le Meur, CERN) Le Meur explained the methodology and technologies developed (partly at CERN) to preserve scientific data (like High Energy Physics) could be re-used for Personal restricted data. Existing initiatives to collect and preserve for very long term the personal data from individuals will first be reviewed, as well as a few examples of well established collective memory portals. Solutions implemented for Open data in HEP will then be compared, looking at the guiding principles and underlying technologies. Finally, a proposal to foster a solid shared platform for closed Personal Data Archive will be drafted on the model of Open Scientific Data Archives.


Personal Data and the Personal Archive (Chelsea Gunn, University of Pittsburgh) Gunn questioned if quantified self and lifelogging application are forms of personal data as a part of our personal archives, or do they constitute a form of ephemera, useful for the purposes of tracking progress toward a goal, but not of long-term interest?

Using Markdown for PDA Interoperability (Jay Datema, Stony Brook University). The only thing you can count on with born-digital projects is that you will have to migrate the content at some point. But having done digital library development for over a decade, I'd like to talk about simple text, and a problem that has a proven solution. Markdown is an intermediate step between text and HTML. If you're writing anything that requires an HTML link, its shortcuts are worth learning. Most web applications rely on the humble submit button. Once text goes in, it becomes part of a database backend. To extract it, it may require a set of database calls, or parsing a SQL file, or hoping that someone wrote a module to let you download what you entered.

Session 6 PDA The Arts moderated by Kate Tasker

From Virtual to Reality: Dissecting Jennifer Steinkamp’s Software-Based Installation (Shu-Wen Lin, New York University) Lin spoke about time-based and digital art combines media and technology that challenges traditional conservation practices while requiring dedicated care from working with Steinkamp’s animated installation Botanic that was exhibited in Times Square Arts: Midnight Moment. Lin's talk focused on the internal structure and relationship between the software used which was Maya, After Effects, scripts, and final deliverables. Lin also spoke about provide a risk assessment that will enable museum professionals as well as the artist herself to identify sustainability and compatibility of digital elements in order to build a documentation that can collect and preserve the whole spectrum of digital objects related to the piece.

The PDAs of Others: Completeness, Confidentiality, and Creepiness in the Archives of Living Subjects (Glen Worthey, Stanford University) The title and inspiration for Worthey's presentation came from the 2006 German film Das Leben der Anderen, which dramatized the covert monitoring of East Germans. Although the biography was "authorized", Worthy spoke on how the process of gathering and documenting materials often reveals tensions between completeness and a respect for privacy; between on-the-record and off-the-record conversations; between the personal and the professional; between the probing of important questions and voyeuristic-seeming observation of the subject's complex inner life.

RuschaView 2.0 (Stace Maple, Stanford University) In 1964, LA Painter, Ed Ruscha put a Nikon Camera in the back of his truck, drove up and down Sunset Strip and shot what would become a continuous panorama of "Every Building on the Sunset Strip" (1966). Maples talk highlighted both Ruscha's multi-decade project, as well as Maple's multi-month attempt to create the metadata required to reproduce something like Ruscha's "Every Building..." publication, in a digital context.

(Pete Schreiner, NCSU) Between 2003-2013 an associated group of independent rocks bands from Bloomington, Indiana shared a tour van. When the owner, a librarian, was preparing to move across the country in 2014, Pete Schreiner, band member and proto-librarian decided to preserve this esoteric collection of local music-related history. Subsequently, as time allowed, he created an online collection of the photographs using Omeka. This case study presents a guerrilla archiving project, issues encountered throughout the process, and attempts to find the balance between professional archiving principles and getting it done.

Day 2

Due to request of a presenter(s) who did not want their slides material recorded/show too others beyound the attendies no photos were taken
Session 7 Documenting Cultures Communities moderated by Michael Olson
(Anna Trammell, University of Illinois) Trammell's talk discussed the experience gained from forming relationships and building trust with the student organizations at the University of Illinois, capturing and processing their digital content, and utilizing these records in instruction and outreach.

(Jennifer Douglas, University of British Columbia) Online grieving and intimate archives: a cyberethnographic approach (Jennifer Douglas, University of British Columbia) Douglas presented a short paper discussing the archiving practices of the community of parents grieving stillborn children. In that paper, Douglas demonstrated how these communities functioned as aspirational archives, not only preserving the past, but creating a space in the world for their deceased children. Regarding the ethics of online research and archiving, Douglas' paper introduced the methodology of cyberethnography and explored its potential connections to the work of digital archivists.

(Barbara Jenkins, University of Oregon) In the talk Jenkins spoke on the development of an Afghanistan personal archives project which was created in 2012 and was able to expand its scope through a short sabbatical supported by the University of Oregon in 2016. The Afghanistan collection Jenkins was able to build combines over 4,000 slides, prints, negatives, letters, maps, oral histories, and primary documents.
Session 8 Narratives Biases Pda Social Justice moderated by Kim Christen

Andrea Pritchett, co-founder of Berkeley Copwatch, Robin Margolis, UCLA MLIS in Media Archives, and Ina Kelleher presented a proposed design for a digital archive aggregating different sources of documentation toward the goal of tracking individual officers. Copwatch chapters operate from a framework of citizen documentation of the police as a practice of community-driven accountability and de-escalation.

Stacy Wood, PhD candidate in Information Studies at UCLA, discussed the ways in which personal records and citizen documentation are embedded within techno-socio-political infrastructural arrangements and how society can reframe these technologies as mechanisms and narratives of resistance.
Session 9 PDA And Memory moderated by Wendy Hagenmaier

Interconnectedness: personal memory-making on YouTube (Leisa Gibbons, Kent State University) Gibbons spoke about the use of YouTube as a personal memory-making space and research questions concerning what conceptual, practical and ethical role institutions of memory have in online participatory spaces and how personal use of online technologies can be preserved as evidence.

(Sudheendra Hangal & Abhilasha Kumar, Ashoka University) This talk was about Cognitive Experiments with Life-Logs (CELL) and how it is a scalable new approach to measure recall of personally familiar names using computerized text-based analysis of email archives. Regression analyses revealed that accuracy in familiar name recall declined with the age of the email, but increased with greater frequency of interaction with the person. Based on those findings, Hangal and Kumar believe that CELL can be applied as an ecologically valid web-based measure to study name retrieval using existing digital life-logs among large populations.

(Frances Corry, University of Southern California) Corry spoke about the built-in feature on most smartphones, tablets, and computers today, and how these tool enables users to “photograph” what rests on the surface of their screens. These “photographs” rather screenshots were presented as a valuable tool worthy of further attention in digital archival contexts.
Session 10 Engaging Communities In PDA 1 moderated by Martin Gengenbach
Introducing a Mobile App for Uploading Family Treasures to Public Library Collections (Natalie Milbrodt, Queens Public Library) The Queens Public Library in New York City has developed a free mobile application for uploading scanned items, digital photos, oral history interviews and “wild sound” recordings of Queens neighborhoods for permanent safekeeping in the library’s archival collections. It allows families to add their personal histories to the larger historical narrative of their city and their country. The tool is part of the programmatic and technological offerings of the library’s Queens Memory program, whose mission is to capture contemporary history in Queens.

(Russell Martin, District of Columbia Public Library) The Memory Lab (Russell Martin, District of Columbia Public Library) The Memory Lab at District of Columbia Public Library is a do-it-yourself personal archiving space where members of the public can digitize outdated forms of media, such as VHS, VHS-C, mini DVs, audio cassettes, photos, slides, negatives and floppy disks. Martin's presentation consists of how the Memory Lab was developed by a fellow from the Library of Congress' National Digital Stewardship Residency, budget for the lab, equipment used and how it is put together, training for staff and the public, as well as success stories and lessons learned.

(Wendy Hagenmaier, Georgia Tech) Hagenmaier's presentation outlined the user research process of the retroTECH team to inform the design of the carts, offer an overview of the carts’ features and use cases, and reflected on where retroTECH’s personal digital archiving services are headed. retroTECH aims to inspire a cultural mindset that emphasizes the importance of personal archives, open access to digital heritage, and long-term thinking.

The Great Migration (Jasmyn Castro, Smithsonian NMAAHC) Castro presented the ongoing film preservation efforts at the Smithsonian for the African American community and how the museum invite visitors to bring their home movies into the museum and have them inspected and digitally scanned by NMAAHC staff.
Session 11 Engaging Communities In Pda 2 moderated by Mary Kidd
Citizen archive and extended MyData principles (Mikko Lampi, Mikkeli University of Applied Sciences) Lampi spoke about how Digitalia – Research Center on Digital Information Management – is developing a professional-quality digital archiving solution available for common people. The Citizen archive relies on an open-source platform allowing users to manage their personal data and ensure access to it on a long-term basis. MyData paradigm is connected with personal archiving by managing coherent descriptive metadata and access rights, while also ensuring privacy and usefulness.

Born Digital 2016: Collecting for the Future (Sarah Slade, State Library Victoria) Slade presented Born Digital 2016: collecting for the future a week-long national media and communications campaign to raise public awareness of digital archiving and preservation and why it matters to individuals, communities and organizations. The campaign successfully engaged traditional television and print media, and online news outlets, to increase public awareness of what digital archiving and preservation is and why it is important.

Whose History? (Katrina Vandeven, MLIS Candidate, University of Denver) Vandeven discussed the macro appraisal and documenting intersectionality within the Women's March on Washington Archives Project, where it went wrong, possible solutions to documenting intersectionality in activism, and introduced the Documenting Denver Activism Archives Project.
Bring Personal Digital Archiving 2017 to a close was Session 12 PDA Retrospect And Prospect Panel moderated by Cathy Marshall

Howard Besser, Clifford Lynch and Jeff Ubois discussed how early observers and practitioners of personal digital archiving will look back on the last decade, and forward to the next, covering changing social norms about what is saved, why, who can view it, and how; legal structures, intellectual property rights, and digital executorships; institutional practices, particularly in library and academic settings, but also in the form of new services to the public; market offerings from both established and emerging companies; and technological developments that will allow (or limit) the practice of personal archiving.
- John

Saturday, May 9, 2015

2015-05-09: IIPC General Assembly 2015 Trip Report


The day before International Internet Preservation Consortium (IIPC) General Assembly 2015 we landed in San Francisco and some delicious Egyptian dishes were waiting for us. Thank you Ahmed, Yasmin, Moustafa, Adrian, and Yusuf for hosting us. It was a great way to spend the evening before IIPC GA and we were delighted to see you people after long time.

Day 1

We (Sawood Alam, Michael L. Nelson, and Herbert Van de Sompel) entered in the conference hall a few minutes after the session was started and Michael Keller from Stanford University Libraries was about to leave the stage after the welcome speech. IIPC Chair Paul Wagner gave brief opening remarks and invited the keynote speaker Vinton Cerf from Google on the stage. The title of the talk was "Digital Vellum: Interacting with Digital Objects Over Centuries" and it was such an informative and delightful talk. He mentioned that the high density low cast storage media is evolving, but the devices to read them might not last long. While mentioning Internet connected picture frames and surf boards he added, we should not forget about the security. To emphasize the security aspect he gave an example that grand parents would love to see their grand children in those picture frames, but will not be very happy if they see something which they do not expect.
Moving on to software emulators he invited Mahadev Satyanarayanan from Carnegie Mellon University to talk about their software archive and emulator called Olive Archive. Satya gave various live demos including the Great American History Machine, ChemCollective (a copy of the website frozen at certain time), PowerPoint 4.0 running in Windows 3.1, and the Oregon Trail, all powered by their virtual machines and running in a web browser. He also talked about the architecture of the Olive Archive and how in future multiple instances can be launched and orchestrated to emulate the subset of the Internet for applications that rely on external services where some instances might run those services independently.
In the QA session someone asked Cerf, how to ask big companies like Google to provide the data about their Crisis Response efforts for archiving after they are done with it? Cerf responded, "you just did." while acknowledging the importance of such data for archival. Here are some tweets that were capturing the moments:
After the break Niels Brügger and Ditte Laursen presented their case study of Danish websphere under the title "Studying a nation's websphere over time: analytical and methodological considerations". Their study covered website content, file types, file sizes, backgrounds, fonts, layout and more importantly the domain names. They also raised the points like size of the ".dk" domain, geolocation, inter and intra domain link network, and if the Danish websites are actually in Danish language? They talked about some crawling challenges. Their domain name analysis tells that only 10% owners own 50% of all the ".dk" domains. I suspected that this result might be due to the private domain name registrations, so I talked to them later and they said, they did not think about private registrations, but they will revisit their analysis.
Andy Jackson from the British Library took the stage with his presentation title "Ten years of the UK web archive: what have we saved?". This case study covers three collections including Open Archive, Legal Deposit Archive, and JISC Historical Archive. These collections store over eight billion resources in over 160TB compressed files and now adding about two billion resources per year. With the help of a nice graph he illustrated that not all ".uk" domains are interlinked, so to maximize the coverage the crawlers need to include other popular TLDs such as ".com". He also presented the analysis of reference rot and content drift utilizing the "ssdeep" fuzzy hash algorithm. Their analysis tells that 50% of resources are unrecognizable or gone after oner year, 60% after two years and 65% after three years.
I had lunch together with Scott Fisher from the California Digital Library. I told him about various digital library and archiving related research projects we are working on at Old Dominion University and he described the holdings of his library and the phalanges they have in upgrading their Wayback to bring Memento support.
After the lunch, keynote speaker of the second session Cathy Marshall from the Texas A&M University took the stage with a very interesting title, "Should we archive Facebook? Why the users are wrong and the NSA is right". She motivated her talk by some interview style dialogues with the primary question, "Do you archive Facebook?" and mostly the answer was "No!". She highlighted that people have developed [wrong] sense that Facebook is taking care of their stuff, so they do not have to. She also noted that people usually do not value their Facebook content or they think it has immediate value, but no archival value. In a large survey she asked should Facebook be archived?, three fourth objected and half of them said "No" unconditionally. In the later part of her talk, she build the story of the marriage of Hal Keeler and Joan Vollmer by stitching various cuttings from local news papers. I am not sure if I could fully appreciate the story due to the cultural difference, but I laughed when everyone else did. Although I did follow her efforts and intention to highlight the need of archiving social media for future historians. And if asks me, is NSA is right? my answer would be, "Yes!, if they do it correctly with all the context included."
Meghan Dougherty from Loyola University Chicago and Annette Markham from Aarhus University presented their talk "Generating granular evidence of lived experience with the Web: archiving everyday digitally lived life". They illustrated how sometimes intentionally or unintentionally people record moments of their life with different media. Among various visual illustrations, I particularly liked the video of a street artist playing with a ring that was posted on Facebook in a very different context than the context it appeared in YouTube. They ended their talk with a hilarious video of Friendster.
Susan Aasman from University of Groningen presented her talk "Everyday saving practices: "small data" and digital heritage strategies". This talk was full of motivation, why people should care about personal archive of their daily life moments. She described how the service Kodak Gallery launched in 2001 with the tag-line, "live forever", and closed in 2012 after transferring billions of images to Shutterfy which was only available for US customers. As a result, people from other countries have lost their photo memories. She also played the Bye Bye Super 8 video of Johan Kramer that was amusing and motivating for personal archiving.
After a short beak Jane Winters from the Institute of Historical Research, Helen Hockx-Yu from the British Library, and Josh Cowls from the Oxford Internet Institute took the stage with their topic "Big UK domain data for Arts and Humanities" also known as BUDDAH project. Jane highlighted the value of archives for research and described the development of a framework to help researchers leverage the archives. She illustrated the interface of the Big Data analysis of BUDDAH project, described the planned output, and various case studies showing what can be done with that data.
Helen Hockx-Yu began her talk "Co-developing access to the UK Web Archive" with reference to the earlier talk by Andy. She noted that a scenario that fits everyone's need is difficult. She described the high level requirements including query building, corpus formation, annotation and cuuration, in-corpus and whole-dataset analysis. She illustrated the SHINE interface that provides features like full-text search, multi-facet filters, query history, and result export.
Finally, Josh Cowls presented his talk about the book "The Web as History: Using Web Archives to Understand the Past and the Present" in which he contributed a chapter. He talked about the four second level domain from ".uk" TLD including ".co.uk", ".org.uk", ".ac.uk", and ".gov.uk" and how they are interlinked. He described the growth of web presence of the BBC and British universities.
IIPC Chair Paul Wagner concluded the day by emphasizing that we have only started scratching the surface. He also noted in his concluding remarks that the context matters.

Day 2

Herbert Van de Sompel from Los Alamos National Laboratory started the second day sessions by talking about "Memento Time Travel". He started with a brief introduction of the Memento followed by a bag full of announcements. For the ease of use in JavaScript clients, Memento now supports JSON responses along with traditional Link format. Memento aggregator now provides responses in two modes including DIY (Do It Yourself) and WDI (We Do It). The service now also allows to export the Time Travel Archive Registry in structured format. Due to the default Memento support in Open Wayback, various Web archives now natively support Memento. There is an extension available for MediaWiki to enable Memento support in it. Herbert described the Robust Links (Hiberlink) and how it can be used to avoid reference rot. He said that their service usage is growing, hence they upgraded the infrastructure and now using Amazon cloud for hosing services. He noted that going forward everyone will be able to participate by running Memento service instances in a distributed manner to provision load-balancing. He also demonstrated Ilya's work of constructing composite mementos from various sources to minimize the temporal inconsistencies while visualizing the sources of mementos.
Daniel Gomes from the Portuguese Web Archive talked about "Web Archive Information Retrieval". He started classifying web archive information needs in three categories including Navigational, Informational, and Transactional. He noted that the usual way of accessing archive is URL searching which might not be known to the users. An alternate method is full-text search that poses the challenge of relevance. Daniel described various relevance models in great detail and how to select features to maximize the relevance. He announced that all the dataset and code is available for free and under open source license. The code is hosted on Google Code, but due to their announcement of sunsetting the service the code will be migrated to GitHub soon.
After this talk, there was a short break followed by the announcement that remaining sessions of the day will have two parallel tracks. It was a hard decision to choose one track or the other, but I can watch the missed sessions latter when the video recordings are made available. Later the parallel sessions were interfering each other so the microphone was turned off.
After the break Ilya Kreymer gave a live demo of his recent work "Web Archiving for all: Building WebRecorder.io". He acknowledged the collaboration with Rhizome and announced the availability of invite only beta implementation of the WebRecorder. He demonstrated how WebRecorder can be used perform personal archiving in What You See Is What You Archive (WYSIWYA) mode.
Zhiwu Xie from VirginiaTech presented "Archiving transactions towards an uninterruptible web service". He described an indirection layer between the web application server and the client that archives each successful response and when server returns 4xx/5xx failure responses, it serves the most recent copy of the resource from the transactional archive. It is similar to services like CloudFlare in functionality from clients' perspective, but it has added advantage of building a transactional archive for website owners. Zhiwu demonstrated the implementation by reloading two web pages multiple times of which one was utilizing the UWS and the other was directly connected to the web application server that was returning the current timestamp with random failures. He mentioned that the system is not ready for the prime time yet.
During the lunch break I was with Andy, Kristinn, and Roger where we had free style conversation on advanced crawlers, CDX indexer memory error issues, the possibility of implementing CDX indexer in Go, separating data and view layers in Wayback for easy customization, some YouTube videos such as "Is Your Red The Same as My Red?", hilarious "If Google was a Guy", Ted talks such as "Can we create new senses for humans?", "Evacuated Tube Transport Technologies (ET3)", and the possible weather of Iceland around the time IIPC GA 2016 is scheduled.
Jefferson Bailey presented his talk on "Web Archives as research datasets". With various examples and illustrations from Archive-It collections he established the point that web archives are great sources of data for various researches. He acknowledged that WAT is a compact and easily parsable metadata file format that is about 18% of the WARC data files.
Ian Milligan from the University of Waterloo presented his talk on "WARCs, WATs, and wgets: Opportunity and Challenge for a Historian Amongst Three Types of Web Archives". He described the importance of web archives and why historians should use web archives. His talk was primarily based on three case studies including Wide Web Scrape, GeoCities End-of-Life Torrent, and Archive-It Longitudinal Collections, Canadian Political Parties & Labour Organizations. I enjoyed his style of storytelling, some mesmerizing visualizations, and in particular the GeoCities case study. He noted that the GeoCities data was not the form of WARC files, instead it was regular Wget crawl.
After a short break Ahmed AlSum from the Stanford University Library (and a WS-DL alumnus) presented his work on "Restoring the oldest U.S. website". He described how he turned yearly backup files of SLAC website from 1992 to 1999 into WARC and CDX files with the help of Wget and by applying some manual changes to mimic the effect as if it was captured in those early days. These transforms were necessary to allow modern Open Wayback system to correctly replay it. Ahmed briefly handed the microphone over to Joan Winters who was responsible to take backups of the website in early days and she described how they did it. Ahmed also mentioned that the Wayback codebase had hardcoded 1996 as the earliest year that was fixed by making it configurable.
As an after thought I would love to see this effort combined with Satya's Olive Archive so that from the server stack to the browser experience all can be replicated as close to the original environment as possible.
Federico Nanni from the University of Bologana presented "Reconstructing a lost website". Looking at the schedule, my first impression was that it is going to be a talk about tools to restore any lost websites and reconstruct all the pages and links with the help of archives. I was wondering if they are aware of Warrick, a tool that was developed at Old Dominion University with this very objective. But, it turned out to be a case study of the world's oldest university established around 1088. One of the many challenges in reconstructing the university website he mentioned was the exclusion of the site from the Wayback Machine for unknown reasons which they tried to resolve together with Internet Archive. Amusingly, one of the many sources of collecting snapshots includes a clone of the site prepared by student protesters.
Last speaker of the second day Michael L. Nelson from Old Dominion University presented the work of his student Scott G. Ainsworth "Evaluating the temporal coherence of archived pages". With an example of Weather Underground site he demonstrated how unrealistic pages can be constructed by archives due to the temporal violations. He acknowledged that among various categories of temporal violations, there are at least 5% cases where there exists a provable temporal violation. He also noted that temporal violation is not always a concern.

Day 3

The third day sessions were in the Internet Archive building, San Francisco instead of the usual Li Ka Shing Center at Stanford University, Palo Alto. A couple of buses transported us to the IA and we enjoyed the bus trip in the valley as the weather was very good. IA staff was very humble and welcoming. The emulator of classical games installed in the lobby of IA turned out to be the prime center of attraction. We came to know some interesting facts about the IA such as the building was a church which was acquired because of its similarity with the IA logo and the pillows in the hall were contributed by various websites with the domain name and logo printed on them.
Sessions before lunch were mainly related to consortium management and logistics these include Welcome to the Internet Archive by Brewster Kahle, Chair address by Paul Wagner, Communication report by Jason Webber, Treasurer report by Peter Stirling, and Consortium renewal by the chair followed by break-out discussions to gather ideas and opinion from the IIPC members on various topics. Also, the date and venue for the next general assembly was announced to be on April 11, 2016 in Reykjavik, Iceland.
After the lunch break, your author, Sawood Alam from Old Dominion University presented the progress report on "Profiling web archives" project, funded by IIPC. With the help of some examples and scenarios he established the point that the long tail of archive matters. He acknowledged the growing number of Memento compliant archives and the growth of use of Memento aggregator service. In order for the Memento aggregator to perform efficiently, it needs query routing support apart from caching which only helps when the requests are repeated before cache expires. Then he acknowledged two earlier profiling efforts one being a complete knowledge profile by Sanderson and the other minimalistic TLD only profile by AlSum. He described the limitations of the two profiles and explored the middle ground for various other possibilities. He evaluated his findings and concluded that his work so far gained up to 22% routing precision with less than 5% cost relative to the complete knowledge profile without any false negatives. Sawood also announced the availability of the code to generate profiles and benchmark them in a GitHub repository. In a later wrap-up session the chair Paul Wagner referred to Sawood's motivation slide in his own words, "sometimes good enough is not good enough."
In the break various IA staff members gave us tour of the IA facility including book scanners, television archive, an ATM, storage rack, music and video archive where they convert data from old recording media such as vinyl discs and cassettes.
After the break a historian and writer Abby Smith Rumsey talked about "The Future of Memory in the Digital Age". Her talk was full of insightful and quotable statements. I will quote one of my favorite and will leave the rest in the form of tweets. Se says, "ask not what we can afford to save; ask what we can afford to lose".
Finally the founder of the Internet Archive, Brewster Kahle took the stage and talked about digital archiving and the role of IA in the form of various initiatives including book archive, music archive, and TV archive to name a few. He described the zero-sum book lending model utilized by the Open Library for the books that are not free for unlimited distribution. He invited all the archivists to create a common collective distributed library where people can share their resources such as computing power, storage, man power, expertise, and connections. During the QA session I asked when he thinks about collaboration, is he envisioning a model similar to the inter-library loan where peer libraries will refer to the other places in the form of external links if they don't have the resources but others do or in contrast they will copy the resources of each other? He responded, "both."
The chair gave a wrap-up talk and formally ended the third day session. Buses still had some time before they leave, so people were engaged in conversation, games and photographs while enjoying drinks and food. I particularly enjoyed a local ice cream named "It's-It" recommended by an IA staff. Lori Donovan from Internet Archive approached me and Mohamed Farag and initiated a good conversation about possible collaboration on archiving projects. We also talked about a project that WS-DL group at Old Domionion University was working on a few years ago to identify disaster related news and archive them. Our conversation ended up with a group selfie of three of us.

Day 4

On fourth day Sara Aubry presented her talk on "Harvesting Digital Newspapers Behind Paywalls" in Berge Hall A where Harvesting Working Group was gathered while IIPC's communication strategy session was going on in Hall B. She discussed her experience of working with news publishers to make their content more crawler friendly. Some of the crawling and replay challenges include paywalls requiring authentication to grant access to the content and inclusion of the daily changing date string in the seed URIs. They modified the Wayback to fulfill their needs, but the modifications are not committed back to the upstream repository. She said, if it is useful for the community then the changes can be pushed out in the main repository.
Roger Coram presented his talk on "Supplementing Crawls with PhantomJS". I found his talk quite relevant to one of my colleague Justin Brunelle's work. This is a necessary step to improve the quality of the crawls especially when sites are becoming more interactive with extensive use of JavaScript. For some pages, he is using CSS selectors and takes screen shots to later complement the rendering.
Kristinn Sigurðsson engaged everyone to talk about the "Future of Heritrix". He started with the question, "is Heritrix dead?" and I said to myself, "can we afford this?". This ignited the talk about what can be done to increase the activity on its development. I asked the question, what is slowing down the development of Heritrix, is it out of ideas and new feature requests or there are not enough contributors to continue the development? There was no clear answer to this question, but it helped continuing the discussion. I also suggested that if new developers are afraid of making changes that will break the system and will discourage upgrades then can we introduce plug-in architecture where new features can be added as optional add-ons.
Helen Hockx-Yu took the microphone and talked about the Open Wayback development. She gave brief introduction of the development workflow and periodic telecon. She also talked about the short and long term development goals including better customization and internationalization support, display more metadata, ways to minimize the live leaks, and acknowledge/visualize the temporal coherence.
After a short break Tom Cramer gave his talk on "APIs and Collaborative Software Development for Digital Libraries". He formally categorized the software development models in five categories. He suggested IIPC to take the position to unify the high level API for each category of the archiving tools so that they can co-operate interchangeably. This was very appealing to me because I was thinking on the same lines and have done some architectural design of an orchestration system that achieves the same goal via a layer of indirection.
Daniel Vargas from LOCKSS presented his talk on "Streamlining deployment of web archiving tools" and demonstrated usage of Docker containers for deployment. He also demonstrated the use of plain WARC files on regular file system and in HDFS with Hadoop clusters. I was glad to see someone else deplying Wayback machine in containers as I was pushing some changes to the Open Wayback repository that will make containerization of Wayback easier.
During the lunch break Hunter Stern from IA approached me and told me about the Umbra project to supplement the crawling of JS-rich pages. Kristinn, me, and a few more people talked about the precision of time in HTTP/2.0, but no one was sure if it was changed from one second granularity to anything smaller such as millisecond or microsecond. Later I asked this question in the IETF HTTP WG mailing list and the response suggests that there was no change made to it. After the lunch there was a short open mic session where every speaker has got four minutes to introduce exciting stuff that they are working on. Unfortunately, due to the shortage of time I could not participate in it.
After the lunch break Access Working Group gathered to talk about "Data mining and WAT files: format, tools and use cases". Peter Stirling, Sara Aubry, Vinay Goel, and Andy Jackson gave talks on "Using WAT at the BnF to map the First World War", "The WAT format and tools for creating WAT files", and "Use cases at Internet Archive and the British Library". Vinay has got some really neat and interactive visualizations based on the WAT files. I talked to Vinay during the break and we had some interesting ideas to work on such as building a content store indexed by hashes while using WAT files in conjunction to replay and a WebSocket based BOINC implementation in JavaScript to perform Hadoop style distributed research operations on IA data on users' machine.
After a short break Access Working Group talked about "Full-text search for web archives and Solr". Anshum Gupta, Andy Jackson, and Alex Thurman presented "Apache Solr: 5.0 and beyond", "Full-text search for web archives at the British Library", and "Solr-based full-text search in Columbia's Human Rights Web Archive" respectively. Anshum's talk was on technical aspects of Solr while the other two talks were more towards a case study.

Day 5

On the last day of the conference Collection Development and Preservation Working Groups were discussing their current state and plans in separate parallel tracks. Before the break I attended Collection Development Working Group. They demonstrated Archive-It account functionality. I expressed the need of a web based API to interact with the Archive-It service. I gave the example of a project I was working on a few years ago in which a feed reader periodically reads from news feeds and sends it to a disaster classifier that Yasmin AlNoamany and Sawood Alam (me) built. If the classifier classifies the news article to be in disaster category, we wanted to archive that page immediately. Unfortunately, Archive-It did not provide a way to programmatically do that (unless we use page scraping or some headless browser), so we ended up using WebCite service for that.
After the break I moved to the Preservation Working Group track where I had a talk scheduled. David S.H. Rosenthal presented his talk on "LOCKSS: Collaborative Distributed Web Archiving For Libraries". He described the working of LOCKSS and how it benefited the publishing industry. He described how Crawljax is used in LOCKSS to capture content that are loaded via Ajax. He also noted that most of the publishing sites try not to rely on Ajax and if they do, they provide some other means to crawl their content to maintain the search engine ranking.
Sawood Alam (me) happened to be the last presenter of the conference where he presented his talk on "Archive Profile Serialization". This talk was in continuation with his earlier talk at IA. He described what should be kept in profiles and how should it be organized. He also talked briefly about the implications of each data organization strategy. Finally he talked about the file format to be used and how it can affect the usefulness of the profiles. He noted that XML, JSON, and YAML like single node file formats are not suitable for profiles and he proposed an alternative format that is a fusion of CDX and JSON formats. Kristinn provided his feedback that it seems the right approach of serialization of such data, but he strongly suggested to name the file format something other than CDXJSON.
While we were having lunch, the chair took the opportunity to wrap-up the day and the conference. And now I would like to thank all the organizing team members especially Jason Webber, Sabine Hartmann, Nicholas Taylor, and Ahmed AlSum for organizing and making the event possible.
In the afternoon Ahmed AlSum took me to the Computer History Museum where Marc Weber gave us a tour. It was a great place to visit after such an intense week.

Missed Talks

Due to the parallel tracks I missed some sessions that I wanted to attend such as "SoLoGlo - an archiving and analysis service" by Martin Klein, "Web archive content analysis" by Mohammed Farag, "Identifying national parts of the internet" by Eld Zierau, "Warcbase: Building a scalable platform on HBase and Hadoop" by Jimmy Lin, "WARCrefs for deduplicating web archives" by Youssef Eldakar, and "WARC Standard Revision Workshop" by Clément Oury to name a few. I hope the videos recordings will be available soon. Meanwhile I was following the related tweets.

Conclusions

IIPC GA 2015 was a fantastic event. I had great time, met a lot of new people and some of those whom I knew on the Web, shared my ideas and learned from others. It was the most amazing one complete week I ever had. I appreciate the efforts of everyone who made this possible including organizers, presenters, and attendees.

Resources

Please let us know the links of various resources related to IIPC GA 2015 to include below.

Official

Aggregations

Blog Posts

Tools

Update (May 12, 2015): Added reference to HTTP/2.0 time resolution and some more blog posts.
Update (May 22, 2015): Added more blogs and tool references.
Update (June 1, 2015): Added link to the video recording playlist.
--
Sawood Alam