Tidy Data For Humanists

two students

Course Description:
Many tools and tutorials promise to help you clean up your messy data, which is an essential step before doing any kind of network, text, spatial, or quantitative analysis or visualization. But how do we even figure out what “clean” means when it comes to complex humanities knowledge, especially when we may not yet know what kind of analysis we eventually want to do? Participants will come out of this class understanding how to create a data plan to capture the parts of their sources that are going to be important for their research questions, handle complex relationships and uncertainty, and format that information into tidy data that can then be reshaped as needed to drive databases, websites, analyses, and visualizations. We will also cover the practical side of using software such as Google Sheets, OpenRefine, and Palladio to collect, tidy, and make exploratory visualizations of humanistic data. Additionally, we’ll learn about using Linked Open Data to interconnect our research databases with objects, documents, and authority lists maintained by institutions such as archives, libraries, and museums, focusing on pragmatic steps that real-life researchers can take to get the most out of connecting their newly-created knowledge (and the data that come with it) back into the larger ecosystem on which we all depend. This course assumes no prior knowledge of databases or coding, and will use freely-available open source tools. We will work with some sample data sets over the course of the week, but participants are encouraged to bring their own data, or sources that they are potentially trying to transform into data, for group “data therapy” sessions in order to apply lessons learned each day to their own work and research.

Dr. Matthew Lincoln is a research software engineer at Carnegie Mellon University Libraries, where he focuses on computational approaches to the study of history and culture, and on making library and archives collections tractable for data-driven research. His current book project with Getty Publications, co-authored with Dr. Sandra van Ginhoven, uses data-driven modeling, network analysis, and textual analysis to mine the Getty Provenance Index Databases for insights into the history of collecting and the art market. He earned his PhD in Art History at the University of Maryland, College Park, and has held positions at the Getty Research Institute and the National Gallery of Art. He is an editorial board member of The Programming Historian.