Web Science and Digital Libraries Research Group: JCDL 2017

Showing posts with label JCDL 2017. Show all posts

Wednesday, July 5, 2017

2017-07-04: Web Archiving and Digital Libraries (WADL) Workshop Trip Report From JCDL2017

Web Archiving and Digital Libraries Workshop was held after JCDL 2017 from June 6, 2017, to June 23, 2017. I live-tweeted both days and you can follow along on Twitter with this blog post using the hashtag wadl2017 or via the notes/minutes of WADL2017. I also created a list on Twitter of the speaker/presenters Twitter handles, go give them a follow to keep up to date with their exciting work.

Day 1 (June 22)

WADL2017 kicked off at 2 pm with Martin Klein and Edward Fox welcoming us to the event by giving an overview and introduction to the presenters and panelists.

@mart1nkle1n kicks off a #JCDL2017 attached session by scrbblinging #WADL2017 hashtag on the blackboard. pic.twitter.com/CUN3fPKS4l
— Sawood Alam (@ibnesayeed) June 22, 2017

Keynote

The opening keynote of WADL2017 was National Digital Platform (NDP), Funding Opportunities, and Examples Of Currently Funded Projects by Ashley Sands (IMLS).

@ashley247 with her opening keynote at #wadl2017 @US_IMLS #jcdl2017 #wadl2017 pic.twitter.com/c0w5mZZGNF
— Martin Klein (@mart1nkle1n) June 22, 2017

In the keynote Sands spoke about the desired values for the national digital platform, how IMLS offers various grant categories and funding opportunities for archiving projects, and the submission procedure for grants as well as tips to writing IMLS grant proposals. Sands also shared what a successful (funded) proposal looks like, and how to apply to become a reviewer of the proposals!

.@ashley247 desired values for national digital platform #wadl2017 pic.twitter.com/XFuIcRjc2z
— John Berlin (@johnaberlin) June 22, 2017

.@ashley247 shares funding opportunities for archiving projects @ #WADL2017 #JCDL2017 pic.twitter.com/HbCL4DkZ54
— Jasmine Mulliken (@jasminemulliken) June 22, 2017

Very helpful tips and recs for IMLS grant proposals! Thanks for a super informative prez @ashley247 #WADL2017 #jcdl2017 pic.twitter.com/KOIdYyou86
— Jasmine Mulliken (@jasminemulliken) June 22, 2017

.@ashley247 successful proposal from 2015. "Combining Social Media Storytelling with Web Archives" @WebSciDL + @archiveitorg #wadl2017 pic.twitter.com/t43S6cZFOq
— John Berlin (@johnaberlin) June 22, 2017

.@ashley247 apply to be a reviewer, more voices more diveristy #wadl2017 pic.twitter.com/rrRUXS8yys
— John Berlin (@johnaberlin) June 22, 2017

Lightning Talks

First up in the lightning talks was Ross Spenser from the New Zealand Web Archive on "HTTPreserve: Auditing Document-Based Hyperlinks" (poster)

.@beet_keeper from New Zealand Web Archive presenting @httpreserve #wadl2017 pic.twitter.com/1z3x9VE7Zn
— John Berlin (@johnaberlin) June 22, 2017

Finally get to meet @beet_keeper! He’s presenting on HTTPreserve, repo at https://t.co/uDJKJ2OCQY. #WADL2017 pic.twitter.com/G92rNVc98N
— Ian Milligan (@ianmilligan1) June 22, 2017

Spenser has created a tool that will check the status of a URL on the live web and if it has been archived by the Internet Archive (httpreserve) which is a part of a large suite of tools under the same name. You can try it out via httpreserve.info and the project is open to contributions from the community as well!

GREAT!! Awesome to know about projects that welcome issues and PRs! #wadl2017
— John Berlin (@johnaberlin) June 23, 2017

The second talk was Muhammad Umar Qasim on "WARC-Portal: A Tool for Exploring the Past". WARC-Portal is a tool that seeks to provide access for researchers to browse and search through custom collections and provides tools for analyzing these collections via Warcbase.

WARC-Portal: A Tool for Exploring the Past #wadl2017 pic.twitter.com/2RsUADChhO
— John Berlin (@johnaberlin) June 22, 2017

Third talks was by Sawood Alam on "The Impact of URI Canonicalization on Memento Count". Alam spoke about the ratio of representations vs redirects obtained from dereferencing each archived capture. For a more detailed explanation of this you can read our blog post or the full technical report.

Impact of URI Canonicalization on Memento Count from Mat Kelly

The final talks was by Edward Fox on "Web Archiving Through In-Memory Page Cache". Fox spoke about Nearline vs. Transactional Web Archiving and the advantages of using a Redis cache.

Paper Sessions

First, up for in paper sessions was Ian Milligan, Nick Ruest and Ryan Deschamps on "Building a National Web Archiving Collaborative Platform: Web Archives for Longitudinal Knowledge Project"

.@ianmilligan1 @ruebot @RyanDeschamps tag team presentation "The WALK Project"#wadl2017 pic.twitter.com/9LkibDm73F
— John Berlin (@johnaberlin) June 22, 2017

The WALK project seeks to address the issue of "To use Canadian web archives you have to really want to use them, that is you need to be an expert" by "Bringing Canadian web archives into a centralised portal with access to derivative datasets".

And now a joint prez on Building a Nat'l Web Archiving Collaborative Platform: Web Archives for Longitudinal Knowledge Proj. #WADL2017 pic.twitter.com/Ya5HZOTt4A
— Jasmine Mulliken (@jasminemulliken) June 22, 2017

Enter WALK: 61 collections, 16 TB of WARC files, developed new Solr front end based on Project Blacklight (currently indexed 250 million records). The WALK workflow consists of using Warcbase and a handful of other command line tools to retrieve data from the Internet Archive, generate scholarly derivatives (visualizations, etc) automatically, upload those derivatives to Dataverse and ensure the derivatives are available to the research team.

.@RyanDeschamps on the WALK workflow, including description of collection with network graphs #WADL2017 pic.twitter.com/cgkRhtCzLV
— Emily Maemura (@emilymaemura) June 22, 2017

To ensure that WALK could scale the WALK project will be building on top of Blacklight and contributing it back to the community as WARCLight.

.@ruebot shows (unsurprisingly) the cutest slide of #WADL2017 pic.twitter.com/g0Wob4Bwz1
— Jasmine Mulliken (@jasminemulliken) June 22, 2017

The second paper of WADL2017 presentation was by Sawood Alam on "Avoiding Zombies in Archival Replay Using ServiceWorker." Alam spoke about how through the use of ServiceWorkers URI's that were missed during rewriting or not rewritten at all due to the dynamic nature of the web can be rerouted dynamically by the ServiceWorker to hit the archive rather than the live web.

Avoiding Zombies in Archival Replay Using ServiceWorker from Sawood Alam

Ian Milligan was up next presenting "Topic Shifts Between Two US Presidential Administrations". One of the biggest questions that Milligan noted during his talk was how to proceed training a classifier if there was no annotated data by which to train it by. To address this question (issue), Milligian used bootstrapping to start off via bag of words and keyword matching. He noted that is method works with noisy but reasonable data. The classifiers were trained to look for biases in admins, Trump vs Obama seems to work with dramatic differences and the TL;DR is the classifiers do learn the biases. For more detailed information about the paper see Milligan's blog post about it.

Slides for our #wadl2017 #jcdl2017 talk: “Comparing Topic Shifts Between Two US Presidential Administrations.” https://t.co/rrDsb0HJQu pic.twitter.com/o4OiqZtLzI
— Ian Milligan (@ianmilligan1) June 22, 2017

Closing the first day of WADL2017 was Brenda Reyes Ayala with the final paper presentation on "Web Archives: A preliminary exploration vs reality". Ayala spoke about looking at Archive-It support tickets, as XML, then cleaned and anonymized then using qualitative coding, grounded theory for analysis and presented three expectations when considering user expectations, their mental models when working with archives.

The original website had X number of documents, it would also follow that the archived website also has X number of documents.

Reality: an archived website was often much larger or smaller than the user had expected.

A web archive only includes content that is closely related to the topic.

Reality: Due to crawler settings, scoping rules, and the nat-6/23ure of the web, web archives often include content that is not topic-specific. This was especially the case with social media sites. Users saw the presence of this content as being of little relevance and superfluous.

Content that looks irrelevant is actually irrelevant.

Reality: A website contains pages or elements that are not obviously important but help “behind the scenes” to make other elements or pages render correctly or function properly.

This is knowledge that is known by the partner specialist, but usually unknown or invisible to the user or creator of an archive. Partner specialists often had to explain the true nature of this seemingly irrelevant content Domains and sub-domains are the same thing, and they do not affect the capture of a website.

Reality: These differences usually affect how a website is captured.

2017-08-25 edit: Slides accompanying Ayala's talk made available. Web archives: A preliminary exploration of user expectations vs. reality hosted by The Portal to Texas History

Day 2 (June 23)

Day two started off with a panel featuring Emily Maemura, Dawn Walker, Matt Price, and Maya Anjur-Dietrich on "Challenges for Grassroots Web Archiving of Environmental Data". The first event hosted took place in December in Toronto to preserve the EPA data from the Obama administration during the Trump transition. The event had roughly two-hundred participants and during the event hundreds of press articles, tens of thousands of URL’s seeded to Internet Archive, dozens of coders building tools and a sustainable local community of activists interested in continuing the work. Since then seven events in Philly, NYC, Ann Arbor, Cambridge MA, Austin TX, Berkeley were hosted/co-hosted with thirty-one more planned in cities across the country.

Matt Price talks about very important EDGI project--keeping track of environmental data online: https://t.co/21JcMwsfBm #WADL2017 #JCDL2017
— Jasmine Mulliken (@jasminemulliken) June 23, 2017

After the panel was Tom J. Smyth on Legal Deposit, Collection Development, Preservation, and Web Archiving at Library and Archives Canada Web Archival Scoping Documents. Smyth spoke on questions about how to start building a collection for a budding web archive that does not have the scale as well as an established one and that it has:

Web Archival Scoping Documents

What priority
What type
What are we trying to document
What degree are we trying to document

Controlled Collection Metadata, Controlled vocabulary

Evolves over time with the collection topic

Quality Control Framework

Essential for setting a cut-off point for quality control

Selected Web Resources must pass four checkpoints

Is the resource in-scope of the collection and theme
(when in doubt consult the Scoping Document)
Heritage Value, is the content unique available in other formats,
(what contexts can it be used)
Technology / Preservation
Quality Control

.@smythbound gives @RyanDeschamps a #wadl2017 shoutout for his "topic Jeopardy," when thinking about curating collections. pic.twitter.com/k26LhCdObo
— Ian Milligan (@ianmilligan1) June 23, 2017

@smythbound you won the cuteness round of #Zombies and #Unicorns against @ibnesayeed at #WADL2017 #JCDL2017 pic.twitter.com/xqE9yR22aL
— Sawood Alam (@ibnesayeed) June 23, 2017

The next paper presenters up were Muhammad Umar Qasim and Sam-Chin Li for "Working Together Toward a Shared Vision: Canadian Government Information Digital Preservation Network (CGI - DPN)". The Canadian Government Information Digital Preservation Network (CGI - DPN) is a project that seeks to preserve digital collections of government information and ensure the long-term viability of digital materials through geographically dispersed servers, protective measures against data loss, and forward format migration. The project will also as a backup server in cases where the main server is unavailable as well as act as a means of restoring lost data. To achieve the goals the project is using Archive-It for the web crawls and collection building then using LOCKSS to disseminating the collections to additional peers (LOCKSS nodes).

Sam-chin Li and Muhammed Umar Qasim "Working Together Towards A Shared Vision" #wadl2017 pic.twitter.com/oLN8o4ftdb
— John Berlin (@johnaberlin) June 23, 2017

Nick Ruest was up next speaking on "Strategies for Collecting, Processing, and Analyzing Tweets from Large Newsworthy Events". Ruest spoke about how Twitter is big data and handling the can be difficult. Ruest also spoke about how to handle the big Twitter data in a sane manner by using tools such as Hydrator or Twarc from the DocNow project.

.@ruebot Strategies for handling large news worthy tweet collections @documentnow #wadl2017 pic.twitter.com/Mu0j9Lzhtw
— John Berlin (@johnaberlin) June 23, 2017

Here are my #WADL2017 slides if you want follow along at home.https://t.co/PZtJzhxhMH

❤️ @documentnow
— nick ruest (@ruebot) June 23, 2017

The final paper presentation of the day was Saurabh Chakravarty, Eric Williamson, and Edward Fox on "Classification of Tweets using Augmented Training". Chakravarty discussed using the cosine similarity measure on Word2Vec based vector representation of tweets and how it can be used to label unlabeled examples. How training a classifier using Augmented Training does provide improvements in classification efficacy and how a Word2Vec based representation generated out of a richer corpus like Google News provides better improvements with augmented training.

"Classification Of Tweets Using Augmented Training" Datasets #wadl2017 pic.twitter.com/pLKywUe3YP
— John Berlin (@johnaberlin) June 23, 2017

This is a great project, given how hard it is to classify short tweets – they train a dataset, then use on auxiliary tweets. #wadl2017 pic.twitter.com/wFPDFhH31o
— Ian Milligan (@ianmilligan1) June 23, 2017

Closing Round Table On WADL

The final order of business for WADL 2017 was a round table discussion with the participants and attendees concerning next years WADL and how to make WADL even better. There was a lot of great ideas and suggestions made as the round table progressed with the participants of this discussion becoming the most excited about the following:

WADL 2018 (naturally of course)
Seeking out additional collaboration and information sharing with those who are actively looking for web archiving but are unaware of / did not meet up for WADL
Looking into bringing proceedings to WADL, perhaps even a journal
Extending the length of WADL to a full two or three day event
Integration of remote site participation for those who wish to attend but can not due to geographical location or travel expenses

Till Joint Conference on Digital Libraries 2018 June 3 - 7 in Fort Worth, Texas, USA
- John Berlin

Thursday, June 29, 2017

2017-06-29: Joint Conference on Digital Libraries (JCDL) 2017 Trip Report

The 2017 Joint Conference on Digital Libraries (JCDL) took place at the University of Toronto, Canada. From June 19-23, we (WS-DL) attended workshops, tutorials, panels, and a doctoral consortium. The theme of this year's conference was #TOscale, #TOanalyze, and #TOdiscover. The conference provided researchers from disciplines such as Digital Library research and information science, with the opportunity to communicate the findings of their respective research areas.

Day 1 (June 19)

The first day (pre-conference) of the conference kicked off with a Doctoral Consortium and a Tutorial - Introduction to Digital Libraries. These events took place in parallel with a Workshop - 6th International Workshop on Mining Scientific Publications (WOSP 2017). The final event of the day was a tutorial titled, "Scholarly Data Mining: Making Sense of the Scientific Literature"

Day 2 (June 20)

. #JCDL2017 kicks off with opening remarks by @ianmilligan1 pic.twitter.com/Hs182dws7C
— JCDL2017 (@jcdl2017) June 20, 2017

The conference official started on the second day with opening remarks from Ian Milligan, shortly followed by a keynote from Liz Lyon in which she presented a retrospective on data management, highlighting the successes and achievements of the last decade, as well as assessing the the current state of data, and providing insight into the research, policies and practices needed to sustain progress.

Liz Lyon discussing the relevant skills for data professionals #jcdl2017 pt 1 pic.twitter.com/0Md4bDEdnH
— Justin F Brunelle (@justinfbrunelle) June 20, 2017

... And pt 2 #jcdl2017 pic.twitter.com/1RXsZ3JlJd
— Justin F Brunelle (@justinfbrunelle) June 20, 2017

Great data science roles slide from our opening keynote Liz Lyon #JCDL2017 pic.twitter.com/BmRIBWNpVo
— JCDL2017 (@jcdl2017) June 20, 2017

Liz Lyon: 3D model of #openscience - access, participation and transparency #JCDL2017 pic.twitter.com/FipUtwy9ud
— Daniel Bangert (@enigmaticocean) June 20, 2017

Following Liz Lyon's keynote, Dr. Justin Brunelle opened the Web archives paper session with a presentation for a full paper titled, "Archival Crawlers and JavaScript: Discover More Stuff but Crawl More Slowly." In this presentation, he discussed the challenges Web archives face in crawling pages with deferred representations due to JavaScript, and proposed a method for discovering and archiving deferred representations and their respective descendants which are only visible from the client.

.@mart1nkle1n introducing @justinfbrunelle's paper "discover more stuff but crawl more slowly" #jcdl2017 see also https://t.co/wTDDZokQeE pic.twitter.com/1v58FjNwnk
— Michael L. Nelson (@phonedude_mln) June 20, 2017

Next up at #JCDL2017 : Web Archives! I can't think of a better thing to hear about on a Tuesday afternoon. pic.twitter.com/2JI4jTQGiw
— Jasmine Mulliken (@jasminemulliken) June 20, 2017

.@justinfbrunelle proposes changes to @internetarchive 's wayback and other crawler techs to prevent JS zombies in web archives #jcdl2017 pic.twitter.com/Zcrcz624Oc
— Jasmine Mulliken (@jasminemulliken) June 20, 2017

#webarchiving needs to be better. Glad people at #jcdl2017 like @justinfbrunelle are working on it! pic.twitter.com/AIkneZrXvr
— Jasmine Mulliken (@jasminemulliken) June 20, 2017

Next, Faryaneh Poursardar presented a short paper - "What is Part of that Resource? User Expectations for Personal Archiving," where she talked about the difficulty users face in deciding the answer to the question: What is part of and what is not part of an Internet resource? She also explored various user perception of this question and its implications on personal archiving.

Faryaneh Poursarder on users' perception re parts of a web resource. Important to understand for web archiving. #jcdl2017 pic.twitter.com/GmSl9m5VsO
— Martin Klein (@mart1nkle1n) June 20, 2017

Faryaneh Poursardar: We expect an archived url to present full content, not just e.g. 1st page of a 3-pg article. #webarchiving #jcdl2017
— Jasmine Mulliken (@jasminemulliken) June 20, 2017

Web users expect multi-page news articles to be archived together, but don't care if the ads are captured. #jcdl2017
— Justin F Brunelle (@justinfbrunelle) June 20, 2017

Next, Dr. Weijia Xu presented a short paper - "A Portable Strategy for Preserving Web Applications and Data Functionality". Dr. Xu proposed a preservation strategy for decoupling web applications and from data and the hosting environment in order to improve reproducibility and portability of the applications across different platforms over time.

Weijia Xu from #TACC on a strategy for preserving web apps and data functionality #jcdl2017 pic.twitter.com/vFaOeFSNiM
— Martin Klein (@mart1nkle1n) June 20, 2017

Weijia Xu explains why web archiving, emulation,
& virtualization strategies aren't effective for database/dynamic web projects #jcdl2017 pic.twitter.com/Wm69lrfvaG
— Jasmine Mulliken (@jasminemulliken) June 20, 2017

Sawood Alam was scheduled to present his short paper titled: "Client-side Reconstruction of Composite Mementos Using ServiceWorker," but his flight was cancelled the previous day, delaying his arrival until after the paper session.

And here's why pic.twitter.com/Z4YvGfFs9F
— Michele Weigle (@weiglemc) June 19, 2017

Dr. Nelson presented the paper on his behalf, and discussed the use of ServiceWorker (SW) web API to help archival replay systems avoid the problem of incorrect URI references due to URL rewriting, by strategically rerouting HTTP requests from embedded resources instead of rewriting URLs.

.@phonedude_mln, presenting on behalf of @ibnesayeed: Client-side Reconstruction of Composite Mementos Using ServiceWorker #jcdl2017 pic.twitter.com/Ra8MURdXhB
— Alexander C. Nwala (@acnwala) June 20, 2017

Zombies at #jcdl2017 - @phonedude_mln presenting since @ibnesayeed is en route after travel hell pic.twitter.com/tq36uf3ZNc
— Justin F Brunelle (@justinfbrunelle) June 20, 2017

slides for "Client-side Reconstruction of Composite Mementos Using ServiceWorker" https://t.co/MxLMhVRRcX @ibnesayeed @WebSciDL #jcdl2017
— Michael L. Nelson (@phonedude_mln) June 20, 2017

The conference continued with the second paper session (Semantics and Linking) after a break. This session consisted of a pair of full paper presentations followed by a pair of short paper presentations.

First, Pavlos Fafalios presented - "Building and Querying Semantic Layers for Web Archives," which was also a Vannevar Bush Best Paper Nominee. Pavlos Fafalios proposed a means to improve the use of web archives. He highlighted the lack of efficient and meaningful methods for exploring web archives, and proposed an RDF/S model and distributed framework that describes semantic information about the content of web archives.

Building semantic layers on 3 different types of archives - work by Pavlos Fafalios et al. from L3S #jcdl2017 pic.twitter.com/UKtndeVvg1
— Michele Weigle (@weiglemc) June 20, 2017

@pavlos098 from #L3S shares semantic layers for sample web archive collections: https://t.co/q7ustAGKzL @helgeho #jcdl2017
— Martin Klein (@mart1nkle1n) June 20, 2017

Second, Abhik Jana presented "WikiM: Metapaths based Wikification of Scientific Abstracts" - a method of wikifying scientific publication abstracts - in order to effectively help readers decide whether to read the full articles.

WikiM - improving the wikification of scientific abstracts. Extract entities and link to corresponding Wikipedia articles #jcdl2017 @IITKgp pic.twitter.com/qfe694pAlt
— Michele Weigle (@weiglemc) June 20, 2017

Third, Dr. Jian Wu presented "HESDK: A Hybrid Approach to Extracting Scientific Domain Knowledge Entities." Dr. Jian Wu presented a variant of automatic keyphrase extraction called Scientific Domain Knowledge Entity (SDKE) extraction. Unlike keyphrases (important noun phrases of a document), SDKEs refer to a span of text which represents a concept which can be classified as a process, material, task, dataset etc.

Fourth, Xiao Yang presented "Smart Library: Identifying Books in a Library using Richly Supervised Deep Scene Text" - a library inventory building/retrieval system based on scene text reading methods, which has the potential of reducing the manual labor required to manage book inventories.

The third paper session (Collection Access and Indexing) began with Martin Toepfer's presentation of his full paper (Vannevar Bush Best Paper Nominee) titled: "Descriptor-invariant Fusion Architectures for Automatic Subject Indexing: Analysis and Empirical Results on Short Texts." He discussed the need for digital libraries to automatically index documents accurately especially considering concept drift and amid a rapid increase in content such as scientific publication. Martin Toepfer also discussed the approaches for automatically indexing as a means to help researchers and practitioners in digital libraries decide the appropriate methods for automatic indexing. Next, Guillaume Chiron, presented his short paper titled: "Impact of OCR errors on the use of digital libraries. Towards a better access to information." He discussed his research to estimate the impact of OCR errors on the use of the Gallica Digital Library from the French National Library, and proposed a means for predicting the relative mismatch between queried terms and the target resources due to OCR errors.

Kevin Page @OxfordeResearch @MusiCog @POrg outlining model of information-seeking in large-scale DLs such as @hathitrust #JCDL2017 pic.twitter.com/KnRIrhTSWU
— Daniel Bangert (@enigmaticocean) June 20, 2017

Next, Dr. Kevin Page presented a short paper titled: "Information-Seeking in Large-Scale Digital Libraries: Strategies for Scholarly Workset Creation." He discussed his research which examined the information-seeking models ('worksets') proposed by the HathiTrust Research Center for research into the 15 million volumes of HathiTrust content. This research also involved assessing whether the information-seeking models effectively capture emergent user activities of scholarly investigation.

Great to see @hathitrust #researchcenter work at #JCDL2017 on data capsule model for non-consumptive research pic.twitter.com/3hjO95zlxO
— Robert H. McDonald (@mcdonald) June 20, 2017

Next, Dr. Peter Darch presented a short paper titles: "Uncertainty About the Long-Term: Digital Libraries, Astronomy Data, and Open Source Software." Dr. Darch talked about the uncertainty Digital Library developers experience when designing and implementing Digital libraries by presenting the case study of building the Large Synoptic Survey Telescope (LSST) Digital Library.

.@PeterTDarch & @ashley247 Large Synoptic Survey Telescope leadership chose open source approach to mitigate long-term uncertainty #JCDL2017 pic.twitter.com/gdoctES9QL
— Daniel Bangert (@enigmaticocean) June 20, 2017

The third paper session was concluded with a short paper presentation from Jaimie Murdock titled: "Towards Publishing Secure Capsule-based Analysis," in which he discussed recent advancements in providing aid to HTDL (HathiTrust Digital Library) researchers who intend to publish there results from Big Data analysis from HTDL. The advancements include provenance, workflows, worksets, and non-consumptive exports.

After the Day 2 paper sessions, Dr. Nelson conducted the JCDL plenary community meeting in which attendees where given the opportunity to give feedback to improve the conference. The plenary community meeting was followed by Minute Madness - a session in which authors of posters had one minute to convince the audience to visit their poster stands.

Justin and Emily kick off #minutemadness #JCDL2017 pic.twitter.com/n4m0EKjWiR
— JCDL2017 (@jcdl2017) June 20, 2017

#minutemadness is always one of my favourite parts of a conference like #jcdl2017! Thanks for chairing @emilymaemura and @justinfbrunelle! pic.twitter.com/lFvLzKX7sR
— Ian Milligan (@ianmilligan1) June 20, 2017

.@johnaberlin pitching WAIL in minute madness #jcdl2017 https://t.co/2pzlUL7vd1 @WebSciDL pic.twitter.com/4uvKKqWs3G
— Michael L. Nelson (@phonedude_mln) June 20, 2017

The Minute Madness gave way to the poster session and a reception followed.

#jcdl2017 poster session with @emilymaemura pic.twitter.com/nqMyhE6uAL
— Justin F Brunelle (@justinfbrunelle) June 20, 2017

The biggest #jcdl2017 poster @machawk1 @ibnesayeed pic.twitter.com/2jAtdI2p4g
— Justin F Brunelle (@justinfbrunelle) June 20, 2017

@dblp_org guys at #JCDL2017 . To me the best computer science biblio db out there. Add citations and conquer the world! ;) pic.twitter.com/XmQh2vfort
— Gianmaria Silvello (@giansilv) June 20, 2017

Day 3 (June 21)

Day 3 started with a keynote from Dr. Raymond Siemens, in which he discussed the ways social scholarship framing of the production, accumulation, organization, retrieval, and navigation of knowledge, encourages building knowledge to scale in a Humanistic context.

@RayS6 at #JCDL2017 in action on Building Knowledge to Scale pic.twitter.com/XOTsWbGzAO
— JCDL2017 (@jcdl2017) June 21, 2017

Ru Siemens keynote at #JCDL2017 https://t.co/hkuVsXaT9o
— JCDL2017 (@jcdl2017) June 21, 2017

Ray Siemens keynote at #JCDL2017 https://t.co/E8UnnyZKXf
— JCDL2017 (@jcdl2017) June 21, 2017

Following the keynote, the fourth paper session (Citation Analysis) began with a prerecorded full paper (Vannevar Bush Best Paper Nominee) presentation from Dr. Saeed-Ul Hassan titled: "Identifying Important Citations using Contextual Information from Full Text," in which he addressed the problem of classifying cited work into important and non-important classes with respect to the developments presented in a research publication, as an important step for algorithms designed to track emerging research topics. Next, Luca Weihs presented a full paper titled: "Learning to Predict Citation-Based Impact Measures." He presented non-linear probabilistic techniques for predicting the future scientific impact of impact of a research paper. This is unlike linear probabilistic methods which focus on understanding the past and present impact of a paper. The final full paper presentation from this session was titled: "Understanding the Impact of Early Citers on Long-Term Scientific Impact" and presented by Mayank Singh. Mayank Singh presented his investigation to see if the set of authors who cite a paper early (within 1-2 years), affect the paper's Long-Term Scientific Impact (LTSI). In his research he discovered that influential early citers negatively affect LTSI probably due to "attention stealing."

The conference continued with fifth paper session (Exploring and Analyzing Collections) consisting of three full paper presentations. The first (Student Paper Award Nominee), titled: "Matrix-based News Aggregation: Exploring Different News Perspectives," was presented by Norman Meuschke. He presented NewsBird, Matrix-based News Analysis system (MNA) which help users see news from various perspectives, as a means to help avoid a biased news consumption.

.@normeu presenting https://t.co/SZdsYQYJeu https://t.co/OsKtLrGqKU @BelaGipp #jcdl2017
— Michael L. Nelson (@phonedude_mln) June 21, 2017

The second paper (Vannevar Bush Best Paper Nominee), titled: "Quill: A Framework for Constructing Negotiated Texts - with a Case Study on the US Constitutional Convention of 1787," was presented by Dr. Nicholas Cole, who presented the Quill framework. Quill is a new approach to present and study formal negotiation records such as creation of constitutions, treaties, and legislation. Quill currently hosts the records of the Constitutional Convention of 1787 that wrote the Constitution of the United States.

.@quill1787 is great: reconstructing sequence of events, hierarchy of decision-making, relationships, etc. #JCDL2017 https://t.co/B2Sfxvg4Ep pic.twitter.com/JnPl6qyKjw
— Ian Milligan (@ianmilligan1) June 21, 2017

You mean Geo. Washington didn't write the US Constitution all by himself? Great presentation by @quilldir #jcdl2017 https://t.co/XJydygYXyj pic.twitter.com/tTst8ncFPm
— Michele Weigle (@weiglemc) June 21, 2017

The final presentation for this session was from Dr. Kevin Page, titled: "Realising a Layered Digital Library: Exploration and Analysis of the Live Music Archive through Linked Data," in which he discussed his research which followed a Linked Data approach to build a layered Digital Library, utilizing content form the Internet Archive's Live Music Archive.

The sixth paper session (Text Extraction and Analysis) consisted of three full paper presentation. The first, titled: "A Benchmark and Evaluation for Text Extraction," was presented by Dr. Hannah Bast. Dr. Bast highlighted the difficulty of extracting text from PDF documents due to the fact that PDF is a layout-based format which specifies position information of characters rather than semantic information (e.g., body text or footnote). She also presented her evaluation result of 13 state of the art tools for extracting text from PDF. She showed that her method Icecite outperformed other tools, but is not perfect, and outlined the steps necessary to make text extraction from PDF a solved problem. Next, Kresimir Duretec presented "A text extraction software benchmark based on a synthesized dataset." To help text data processing workflows in digital libraries, he described a dataset generation method based on model driven engineering principles and use it to synthesize a dataset and its ground truth directly from a model. He also presented a benchmark for text extraction tools. This paper session was concluded with a presentation by Tokinori Suzuki titled: "Mathematical Document Categorization with Structure of Mathematical Expressions." He presented his research in Mathematical Document Categorization (MDC) - a task of classifying mathematical documents into mathematical categories such as Probability theory and Set theory. He proposed a classification method that uses text and structures of mathematical expressions.

The seventh paper session (Collection Building) consisted of three full paper presentation, and began with Dr. Federico Nanni's presentation (Best Student Paper Award Nominee) titled: "Building Entity-Centric Event Collections." Federico Nanni presented an approach that utilizes large web archives to build event-centric sub-collections consisting of core documents related to the events as well as documents associated with the premise and consequences of events.

.@f_nanni on need to find early stages of an event, approach using search for related entities #JCDL2017 pic.twitter.com/gyz9NW2e32
— Emily Maemura (@emilymaemura) June 21, 2017

Next, Jan R. Benetka, presented a paper titled: "Towards Building a Knowledge Base of Monetary Transactions from a News Collection," where he addressed the problem of extracting structured representations of economic events (e.g., large company buyouts) from a large corpus of news articles. He presented a method which combines natural language processing and machine learning techniques to address this task.

.@janbenetka github repo mentioned in his presentation https://t.co/UEcachQxwb #jcdl2017
— John Berlin (@johnaberlin) June 21, 2017

I concluded the seventh paper session with a presentation titled: "Local Memory Project: providing tools to build collections of stories for local events from local sources". In this presentation, I discussed the need to expose local media sources, and introduced two tools under the umbrella of the Local Memory Project. The first tool - Geo, helps users discover nearby local news media sources such as newspapers, TV, and radio stations. The next - a Collection building tool, helps users build, save, share, and archive collections of local events from local sources for US and non-US media sources.

.@acnwala presenting the Local Memory Project https://t.co/d8zXXw6twB #jcdl2017 @abziegler @WebSciDL @HarvardLIL pic.twitter.com/ol5S7c8Ylc
— Michael L. Nelson (@phonedude_mln) June 21, 2017

@acnwala from @WebSciDL is presenting "Local Memory Project" work at #jcdl2017 with a very longest title. https://t.co/TNcgVmjKok pic.twitter.com/uDIR5Xys20
— Sawood Alam (@ibnesayeed) June 21, 2017

Here are the slides I presented:

Local Memory Project from anwala

The eighth paper session (Classification and Clustering) occurred in parallel with the sixth paper session. It consisted of a pair of full papers and a pair of short papers. The first paper titled: "Classifying Short Unstructured Data using the Apache Spark Platform," was presented by Saurabh Chakravarty. Saurabh Chakravarty highlighted the difficulty traditional classifiers have in classifying tweets. This difficulty is partly due to the shortness of tweets, and the presence of abbreviations, hashtags, emojis, and non-standard usage of written language. Consequently, he proposed the used of the Spark platform to implement two shot text classification strategies. He also showed these strategies are able to effectively classify millions of text composed of thousands of distinct features and classes. Next, Abel Elekes presented his full paper (Best Student Paper Award Nominee) titled: "On the Various Semantics of Similarity in Word Embedding Models," in which he discussed results running two experiments to determine when exactly similarity scores of word embedding model is meaningful. He proposed that his method could provide a better understanding of the notion of similarity in embedding models and improve the the evaluation of such models. Next, Mirco Kocher presented his short paper titled: "Author Clustering Using Spatium." Mirco Kocher proposed a model for clustering authors after presenting the author clustering problem as it relates to authorship attribution questions. The model he proposed uses a distance measure called Spatium which was derived from weighted version of L1 norm (Canberra measure). He showed that this model evaluation produced high precision and F1 values when tested with a 20 test collection. Finally Shaobin Xu presented a short paper titled: "Retrieving and Combining Repeated Passages to Improve OCR." He presented a new method to improve the output of Optical Character Recognition (OCR) systems. The method begins with detecting duplicate passages, then it performs a consensus decoding which is combined with a language model.

The ninth paper session (Content Provenance and Reuse), began with Dr. David Bamman full paper presentation titled: "Estimating the Date of First Publication in a Large-Scale Digital Library." Dr. David Bamman discussed his finding from evaluating methods for approximating date of first publication. The methods considered (and used in practice) include: using the date of publication from available metadata, multiple deduplication methods, and automatically predicting the date of composition from text of the book. He found that using a simple heuristic of metadata-based deduplication performs best in practice.

David Bamman (@dbamman) discussing methods to estimate the date of first publication of books in @hathitrust digital library #JCDL2017
— Daniel Bangert (@enigmaticocean) June 21, 2017

Dr. George Buchanan presented his full paper titled: "The Lowest form of Flattery: Characterising Text Re-use and Plagiarism Patterns in a Digital Library Corpus," in which he discussed a first assessment of text re-use (plagiarism) for the digital libraries domain, and suggested measures for more rigorous plagiarism detection and management.

The literature shows high levels of plagiarism across a range of scholarly fields but also that actions are rarely taken #jcdl2017
— Rebecca Parker (@libodyssey) June 21, 2017

Next, Corinna Breitinger presented her short paper titled: "CryptSubmit: Introducing Securely Timestamped Manuscript Submission and Peer Review Feedback using the Blockchain." She introduced CryptSubmit as a means to address the fear researchers have that their work may be leaked or plagiarized by a program committee or anonymous peer reviewers. CryptSubmit utilizes the decentralized Bitcoin blockchain to establish trust and verifiability by creating a publicly verifiable and tamper-proof timestamp for manuscript.

Corinna Breitinger (@BreitingerC) presenting secure timestamping of manuscript submission using blockchain #JCDL2017 https://t.co/7AJPlgz8cn pic.twitter.com/HkzktAx3XL
— Daniel Bangert (@enigmaticocean) June 21, 2017

@BreitingerC showing off https://t.co/MdFdns1yk2 API @BelaGipp as part of CryptSubmit work https://t.co/cnj9l3n0DL #jcdl2017 pic.twitter.com/ZEJKNMUSEP
— Michele Weigle (@weiglemc) June 21, 2017

Next, Mayank Singh a short paper titled: "Citation sentence reuse behavior of scientists: A case study on massive bibliographic text dataset of computer science." He proposed a new model of conceptualizing plagiarism in scholarly research based on reuse of explicit citation sentences in scientific research articles, which is unlike traditional plagiarism detection which uses text similarity. He provided examples of plagiarism and revealed that this practice is widespread even for well known researchers.

A conference banquet at Sassafraz Restaurant followed the last paper session of the day.

Good times had by all at #JCDL2017 Banquet - congrats to our award winners! pic.twitter.com/RPVyohaFxC
— JCDL2017 (@jcdl2017) June 22, 2017

During the banquet, awards for best poster, best student paper, and the Vannevar Bush best paper award, were given. Sawood Alam received the most votes for his poster - Impact of URI Canonicalization on Memento Count - thus, received the award for best poster. Felix Hamborg, Norman Meuschke, and Dr. Bella Gipp, received the best student paper award for: "Matrix-based News Aggregation: Exploring Different News Perspectives." Finally, Dr. Nicholas Cole, Alfie Abdul-Rahman, and Grace Mallon received the Vannevar Bush best paper award for "Quill: A Framework for Constructing Negotiated Texts - with a Case Study on the US Constitutional Convention of 1787."

Congratulations to @ibnesayeed @machawk1 @LulwahMA on best poster at #jcdl2017! pic.twitter.com/GYG16Gw1jO
— Ian Milligan (@ianmilligan1) June 22, 2017

... and Felix Hamborg, Norman Meuschke, and Bella Gipp win best student paper! #JCDL2017 pic.twitter.com/VbSJLTdjB1
— Ian Milligan (@ianmilligan1) June 22, 2017

And last but not least the @quill1787 wins #jcdl2017 Vannevar Bush Best Paper Award. Congrats @quilldir! pic.twitter.com/5IzuEWdXPV
— Ian Milligan (@ianmilligan1) June 22, 2017

Day 4 (June 22)

Day four of the conference began with a panel session titled: "Can We Really Show This?: Ethics, Representation and Social Justice in Sensitive Digital Space," in which ethical issues experienced by curators who work with sensitive and contentious content from marginalized populations was addressed. The panel consisted of Deborah Maron (Moderator), and the following speakers: Dorothy Berry, Raegan Swanson, and Erin White.

Excited to start off the morning with the panel on ethics, representation and social justice #JCDL2017 pic.twitter.com/QJUawp1h7z
— Emily Maemura (@emilymaemura) June 22, 2017

Berry: Archival Practice Makes Metadata Complications. #jcdl2017 pic.twitter.com/5iKhz2qEwp
— Ian Milligan (@ianmilligan1) June 22, 2017

The tenth and last paper session (Scientific Collections and Libraries) followed and consisted of three full paper presentations. First, Dr. Abdussalam Alawini, presented a paper titled: "Automating data citation: the eagle-i experience," in which he highlighted the growing concern of giving credit to contributors and curators of datasets. He presented his research in automating citation generation for an RDF dataset called eagle-i, and discussed a means to generalize this citation framework across a variety of different types of databases. Next, Sandipan Sikdar presented "Influence of Reviewer Interaction Network on Long-term Citations: A Case Study of the Scientific Peer-Review System of the Journal of High Energy Physics" (Best Student Paper Award Nominee). He presented his research which sought to answer the question: "Could the peer review system be improved?" amid a consensus from the research community that it is indispensable but flawed. His research attempted to answer this question by introducing a new reviewer-reviewer interaction network, showing that structural properties of this network surprisingly serve as strong predictors of the long-term citations of a submitted paper. Finally Dr. Martin Klein, presented: "Discovering Scholarly Orphans Using ORCID". Dr. Martin Klein proposed a new paradigm for archiving scholarly orphans - web-native scholarly objects that are largely neglected by current archival practices. He presented his research which investigated the feasibility of using Open Researcher and Contributor ID (ORCID) as a means for discovering the web identities and scholarly orphans for active researchers.

More ORCIDs over time, but low %age of profiles contain works, affiliations, or web identities. #jcdl2017 @mart1nkle1n @hvdsomp @ORCID_Org pic.twitter.com/98UJbIjq3r
— Michele Weigle (@weiglemc) June 22, 2017

Nice use of radar/spider chart to show mismatch between PhD researchers and use of ORCID per discipline. @mart1nkle1n @hvdsomp #jcdl2017 pic.twitter.com/jl4cs2OaJg
— Michele Weigle (@weiglemc) June 22, 2017

Here are the slides he presented:

Discovering Scholarly Orphans Using ORCID from Martin Klein

Dr. Salvatore Mele gave the keynote of the day. He discussed the significant impact Preprints have had on research such has the High-Energy Physics domain which has benefited from a rich Preprint culture for more than half a century. He also reported on the results of two studies that aimed to assess the coexistence and complementarity between Preprints and academic journals that are less open.

Final keynote by Salvatore Mele @CERN #JCDL2017 pic.twitter.com/wjGbcev0My
— Daniel Bangert (@enigmaticocean) June 22, 2017

Do high-energy physicists read preprints or journals? The former, it appears, according to #Jcdl2017 closing keynote Salvatore Mele. pic.twitter.com/7RU49xmmWZ
— Ian Milligan (@ianmilligan1) June 22, 2017

Salvatore Mele on metrics: How to decide what to count? How to count fairly? How to count consistently over time, and places? #JCDL2017 pic.twitter.com/EslzkaT9gJ
— Daniel Bangert (@enigmaticocean) June 22, 2017

The 2017 JCDL conference officially concluded with Dr. Ed Fox's announcement of the 2018 JCDL conference to be held at the University of North Texas.

Edward Fox announcing #JCDL2018 at #JCDL2017. June 3-7, 2018 in the University of North Texas #UNT, USA. pic.twitter.com/9I1VPb1985
— Sawood Alam (@ibnesayeed) June 22, 2017

--Nwala

Web Science and Digital Libraries Research Group