Interview with Andrew Berger: What is the role of an archive in the digital age? Vol II

Andrew Berger is an archivist in the Manuscripts and Archives Department of the Yale University Library. He has a background in history and completed the dual masters program in archives and library and information studies at the University of British Columbia in 2012. These answers are his personal opinions and not those of his employer.

Photo: Lauren Manning

Photo: Lauren Manning

Who determines the worth of something to be digitised – “Quis custodiet ipsos custodes?”, who watches the watchmen. What is the role of the archivist in choosing what to digitise – and how does the cultural/social context of that archivist influence what is kept? (e.g. what was stored and kept from Ancient Greece influences our idea of what the classics are. What about all the works that weren’t retained? Someone decided what was kept, or circumstance e.g. war dictated it, and that has shaped our Western canon)

If you take “digitisation” to mean any sort of digital reproduction – and that’s the loose interpretation of the term that I’m going to be using – then my answer to the first part of the question is that lots of people do, and most of them aren’t archivists. A family scanning old photograph albums or digitising family videos; someone digitising a personal music collection; a community organization putting their materials online; a government agency scanning paper materials for an online exhibit – I think all of this counts. This is all happening in addition to the digitisation efforts that go on in archives (and in related institutions like libraries and museums). So from this perspective, archivists are only one of many groups making decisions about what gets digitised.

But just because I’m taking a broad interpretation of what it means to digitise that doesn’t mean that I think all digitisation is more or less the same. How you digitise something – the technology you use, the metadata you create, the file formats you choose – has an impact on how well you’ll be able to maintain that digital object over time. So even though there’s a lot of people creating digital stuff all the time, I still think that archives (and, again, related institutions) will have a fair amount of influence on what gets kept over the long run, provided of course that they actually commit the time and resources required for digital preservation. Not everyone is going to be able to make that kind of commitment.

So from that longer term perspective, I think that all of the social factors that have long influenced archivists’ work – their backgrounds, training, professional practices, institutional environments, and so on – will continue to operate. I guess I’m kind of side-stepping the second part of the question here, but I don’t think it’s really that much different with digitisation and digital materials than it has been with other kinds of material. It will continue to be the case that those with more power and more resources, both as individuals and as organizations, will be more likely to have their stuff kept over time. Nevertheless, I do think the potential is there for us to preserve a much broader range of materials from our current era than we have for earlier times, simply because so many people and groups now have the ability to create and keep their own stuff. But we will have to make an effort to do that; I don’t think it will simply happen.


What is the best approach, the “quick and dirty” method, of gathering the data (e.g. the flick/turn of pages by a member of the public who scans the books, in good resolution – but not archive standard), or the high resolution approach where you could feasibly use something like Microsoft Silverlight to zoom in on every tiny bit of the page. Where does the worth of something become enough to justify the latter, and will that be mostly older works (since we typically value these higher)?

I think this really depends on what you want to do with it. If your goal is to produce a high-quality digital surrogate that can take the place of the original for most purposes – or if the original is in a format or state where it’s likely to decay within a fairly short period of time – then ideally you would take the highest level approach that you could afford. This means not just using archival standards for file formats and such, but also having a digital preservation system that can handle the digital objects that you’re creating.

If, on the other hand, you’re trying to create copies that are just “good enough” for some purpose – maybe you’re a researcher who just needs to be able to read the text of documents, or you’re doing an art project and you only need a certain resolution, or you just want to be able to listen to an mp3 of some song or speech on your not so great headphones, or you want examples of sources to use in a class you’re teaching but they just have to look ok on a slide – then maybe you don’t need to invest in making high quality copies. Plenty of people nowadays are taking phones and tablets and inexpensive cameras into archives and getting pictures that are good enough for their purposes but that are not really up to archival preservation standards.

Of course, in practice it’s not always easy to tell whether your goals will change. Say you start out digitising with a quick and dirty approach and then later you decide you need to adopt a higher standard. What do you do with what you’ve already digitised? Do you go back and redo it? This is a difficult question and I don’t really have any answer beyond “it depends.” It’s easy enough to say that you should always take the high-quality approach, but that’s not always possible at the outset.


Should you digitise systematically (by chronology, for example) or digitise on demand, which is more important?

I think ideally you would take a systematic approach and organize your digitisation projects around coherent wholes: whole collections, or logical groupings within collections, or groups of related collections. I do think digitisation on demand programs are still a good idea simply because they can increase access, but you have to be careful about how you represent what does not get digitised. As a researcher, I’ve ordered paper photocopies or taken digital photographs of thousands of documents, but it’s been pretty rare for me to copy an entire series or even an entire folder. That is to say, I’ve been highly selective and I wouldn’t want someone looking over the materials I’ve collected from any one archival collection to make the mistake of thinking that what I’ve digitised for my personal use is fully representative of what’s in the collection. Aggregating all the requests made by all the people who’ve used a collection would probably be more representative, but it would almost certainly still leave significant gaps, especially in larger or less-frequently used collections.

Still, even partial digitisation can be usefully suggestive to later researchers. Maybe a request that got digitised turned up something that wasn’t in the finding aid for that collection, and then another researcher comes along and uses that as an entry point for their own research, which then results in more of the collection being digitised, and so on. I think that kind of outcome would be great, but it still might never lead to the whole collection getting digitised without some kind of additional, systematic effort.

Are we digitising to save space – or to preserve – can works be destroyed to save space – or is this unethical?

I firmly believe that if you’ve already made a commitment to preserve the original materials, you need to stand by that as best you can. Digitise for access and preservation, but if you can keep the original – if it’s not irretrievably damaged or decayed – then you shouldn’t destroy it just to make space. This calculation might look different in a situation like in a library where you’re consolidating a collection and you find you have multiple copies of some widely-available book. But even then I don’t think you should destroy the “last” copy just because you’ve digitised the content.

Are paywalls a positive thing for Digital Humanities – should these materials be open source (is this feasible, how would it finance itself)?

I don’t really think I know enough about how projects are financed to be able to answer this question in detail. Certainly I’m in favor of open access/open source models where feasible. Paywalls generally restrict both who can engage in digital humanities work and how far that work reaches outside of the academic community. So to the extent that people working in the digital humanities aspire to make their work more widely available than traditional (academic) humanities work, paywalls can work against that. But I don’t doubt that there are also arguments to be made about how paywalls have made it possible to do some work that wouldn’t have been done at all under a different model.

What is your opinion on the role of large corporations in Digital Humanities? Is the Google approach to digital archiving (e.g. digitisation of books) a positive force – or should large corporations be kept out of the digital arts world?

As with paywalls, I don’t have personal experience working with corporations, but I don’t see why they can’t have some role, provided that corporate interests aren’t what’s driving scholarly work. Just taking Google as an example you can see a wide range of outcomes. I think Google’s book scanning has been a net positive so far, especially with respect to public domain books. Although what’s really made it valuable to me has been all the work that libraries have done through HathiTrust [http://www.hathitrust.org/] to provide another way of accessing digitised books from Google and other sources. I really prefer their catalog and interface to Google’s, and I’m glad that the agreements with Google didn’t prevent that from being developed.

On the other extreme, Google had a newspaper digitisation project for a while but abandoned it before it was done. That’s exactly the kind of outcome we should be trying to avoid.

What are your thoughts about a concept in “Rainbows End” where a machine digitises complete libraries and then pulps them afterwards?

If you’re running rare materials through the machine and you’re supposed to have made a commitment to preserve them, it sounds awful. But I can think of situations where a machine like that might come in handy. Say you don’t have a preservation mandate, you’re sure that what you want to digitise isn’t rare or unique, and you’re fine with having ebooks, then maybe a machine like that doesn’t sound like such a nightmare. I will admit to digitising a few copies of my own books where the pages were falling out and then disposing of the paper book afterward. But I wouldn’t have done that if I hadn’t known that dozens or even hundreds of libraries owned copies of those books.

What are your thoughts on databases for the humanities and your ideas re expanding text and art culture through digitisation?

Overall, I think the development and expansion of research databases has been a great benefit to researchers. Certainly this has been my own experience in doing historical research. When I was an undergraduate, I wrote my final major research paper based on pamphlets I found in the English Short Title Catalog and which I had to find and read via microfilm. I ended up printing out hundreds of pages because there’s really only so long you can sit in front of a microfilm machine. One year later, all of these pamphlets were online. I had similar experiences as a graduate student.

At the same time, databases aren’t without their costs. Many of the ones that I’ve found most useful are ones that I’ve only been able to access because I was fortunate enough to be affiliated with universities that paid for subscriptions. There are inequalities across institutions with respect to access to these databases and I think that’s a real problem.

Many databases also come with restrictions that can have a real effect on the kind of research that can be done using them. Can you download whole documents or are you limited to a certain number of pages? Are you allowed to do things like text mining? A few places are now providing bulk access for research purposes, but I think that’s still the exception.

Finally, I think database providers need to be transparent about both how their database has been created and how it actually works. How were/are the materials chosen? If the database is based on an existing print or microfilm collection, how was that collection created and was anything left out in the microfilming or digitisation process? If it’s black and white, was there anything originally in color? Was oversize material included?

Also important: How does the search function work? A lot of things could be happening in the background: there’s almost certainly a stop list of words not included in the search (like the word “the”), there’s probably also some kind of system in place for identifying roots and stems (how does the database handle plurals?), and there may even be some support for identifying synonyms. Whenever I see people citing search result counts, I wonder: do you know how those numbers were generated?

I’ve gone on at length here, but I recommend reading Benjamin Schmidt’s “What historians don’t know about database design…” [http://sappingattention.blogspot.com/2011/03/what-historians-dont-know-about.html] for an extended take on this issue.

What is your definition of Digital Humanities (DH)?

I try not to have one, to be honest. I don’t mean to be glib. I have been following the ongoing debates about how to define the field, but what I’m primarily interested in what this all means for research, preservation, and access. For example, as more people do or want to do types of text mining across collections, archives need to be thinking about how they can facilitate that. The same goes for access to born digital materials. And farther along in the research process, there’s the question of how you preserve the work that’s being produced: What forms will it take? Databases? Complex websites? Will there be accompanying data sets? Those are more the kinds of questions that I’m interested in right now. So while I think that the definitional debate is an important one, as it will shape the kind of work that gets done, I also feel like I’m at enough remove from it that I don’t need to stake out my own position right now. I could very well be wrong about that, though.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>