Monique Leahey Sugimoto

UCLA Department of Information Studies

IS 277: Information Retrieval Systems

June 15, 2006

 

Genealogy and the Semantic Web:

 A Guide for Family Historians and Amateur Genealogists

 

Genealogy Research

It can happen to anyone: somehow, somewhere and for whatever reason – religious, pastime, or just plain curiosity - you get interested in your family history. It happened to me when I was between jobs in the late 1990s. Trying to make sense of how I got to my state of unemployment, I was naturally led to finding out how I physically got here.  I did what any budding genealogist with an interest in technology would do: I turned to the internet and started looking for family members. After scouring listservs, posting messages to electronic bulletin boards, and searching through countless online databases, I discovered one of my father’s distant cousins. Once our information was verified, I was added to her family tree and she was added to mine. 

 

This serendipitous discovery, verification of relations and ultimate sharing of information has been part and parcel of genealogical research since at least the turn of century. Relying on indexes, already established family histories, and public records (when they were kept), genealogical research a hundred years ago relied mainly on materials in a printed format. Reasons for doing genealogical research and the types of documents used in genealogical research have not changed that dramatically over the years. What has changed however, are the methodologies and tools available to perform research. Where once doing research meant going to a library or traveling to a city or town to consult records, with advances in the World Wide Web, now a great deal of one’s research can be done (or at least started) with a personal computer and a connection to the Internet.  Not only has the Web become a useful tool in conducting genealogy research, it has become an indispensable one. What about the next generation of the Web, the so-called Semantic Web? 

 

Since Tim Berners-Lee’s article on the Semantic Web was published in Scientific America, much attention has been given to this next generation of the Web.  What exactly is the “Semantic Web”?  One way to think about it is to consider current web pages.  Web pages today are meant to be accessed, understood and read by humans. The vision of the Semantic Web is to create web pages (or Web resources) in such a way that a computer can also access, understand and interpret them. With such an understanding, a computer can then perform certain actions and help people with their work.

 

To realize the Semantic Web vision however, new tools and computer languages must be developed since current tools were not designed for such purposes.  This article will cover some of the basic concepts and visions of the Semantic Web and illustrate how these tools are being used (or could be used) to help genealogists with their research. Whether the vision of the Semantic Web will be realized is yet to be determined.  It provides however, an interesting vision for future direction and use of the Web and gives food for thought for potential changes to the process of genealogy research.

 

This article is intended as an introduction for family historians and amateur genealogist to the Semantic Web.  Though a certain amount of familiarity with basic technical terms is assumed, explanations of technical aspects have been written (it is hoped) in a way that is easy to understand.  Links to more technical explanations for further study are given wherever possible. This article starts with an overview of standardization in genealogy since standardization is a key concept to realizing the Semantic Web vision.  It is then followed by an explanation of the various computer languages on which the Semantic Web is developed and shows examples of how genealogy is used with each.

 

Genealogy and Data Standards: What is in a standard?

As noted above, integral to the Semantic Web is standardization, especially for data representations.  It is through standardized representations that computers can make use of the information in various ways.  Before data can be standardized however, there must be agreement on the fundamental concepts of an area, or at least on those aspects for which computer understanding or processing is desired.  Genealogy is full of complexity.  It includes among other core concepts, the process of genealogical research, primary and secondary source documents used in research and of course an understanding of family relationships. The aspect of genealogy focused on in this article is family relationships.

 

Standardizing family relationships is in itself not as easy as it may seem.  How, for example, is the notion of a family defined?  Is a family defined by a married man and women who have at least one child? Is a couple with no children considered a family? Does an unmarried woman with one child whose paternity is not clear constitute a family?  Do all cultures have the same definition of “family”?  It is questions such as these that have come up and which can be seen in the various standards developed for genealogy.  It also illustrates the difficulty in defining the concepts of an area of study since the concepts in the area can be seen from many different perspectives.

 

History of Standardization Efforts in Genealogy

Attempts at standardizing genealogical data go back at least to the 1980s.  In the 1985 proceedings of the American Library Association’s RASD History Section, Genealogy Committee Program, two standards for genealogy data were introduced: MARGEDOS (Machine-Readable Genealogy with DOcumentation of Sources) and GEDCOM (GEnealogical Data COMmunications).  MARGEDOS was developed based on the MARC (Machine-Readable Cataloging) format developed by the Library of Congress for bibliographic cataloging.  Developed out of the library tradition, MARGEDOS was essentially developed as a communication format to allow genealogist to evaluate sources of information.  GEDCOM was developed by the Family History Department of The Church of Jesus Christ of Latter-day Saints (LDS).  (For an excellent article explaining GEDCOM, see GEDCOM Explained.) Its focus is on providing a format to exchange computerized genealogical data with other researchers about family history. The development of GEDCOM grew out of the LDS Church’s attempts to help its members conduct genealogy research as part of their religious beliefs and practices.  Since GEDCOM was presented at the 1985 Genealogy Committee Program, it has been published in several versions.  The latest version is GEDCOM 5.5 which was published in 1996.  Though each of these standards was different in their approaches and the problems they were trying to resolve, MARGEDOS was never adapted.  GEDCOM on the other hand, has become the de-facto standard in genealogy data communications standards for exchanging information using LDS Church-created software as well as other genealogy software.

 

In the late 1990s, an alternate approach was taken to create a model for genealogy.  This model was developed by the GENTECH Data Modeling Project of the National Genealogical Society and focused on the process of genealogy research.  This model addresses one of the important parts of genealogy research, documenting sources and drawing conclusions on an ancestor. (For an easy-to-understand explanation of the process model, click here.) It sought to outline exactly how genealogists conduct their research and not simply the conclusions they come to.  GENTECH did not intend to create a standard with defined fields as the GEDCOM standard does.  Several attempts however, were made to create such a standard based on this model.  (See GedXML, for example.)  Both the GEDCOM and the GENTECH models illustrate differences in worldviews, approaches and intentions, issues which are important when considering the Semantic Web and standardization.   

 

The Semantic Web and Genealogy

The objective of the Semantic Web is to encode data on the Web in a standardized format such that it can be processed by computers. Having a common understanding and agreement of a field then becomes extremely important when considering how a computer can make use of the data.  By encoding data in more standardized ways, data on the Web can be shared and reused for various purposes.  Currently, much of the data, including genealogical data, posted on the web is written in HTML (HyperText Markup Language), a format which cannot be used to perform such processing.  The purpose of HTML rather is to “mark up” or add “tags” to text so that it will be presented or displayed in a certain way.  For example, HTML markup tells your Internet browser what text should be in italics, bold, or even what colors the text should be in.  HTML does not control how the content of a document is structured.  For computers to make use of the documents on the web, a different type of markup language is necessary.    

 

XML

To create more sophisticated formats for data representation, the World Wide Web Consortium has been developing other markup languages.  The first of these languages is XML (eXtensible Markup Language).  As these markup languages have been created, each builds upon the features or properties of the language that precedes it.  As the languages progress, more and more capabilities are added to enable the type of processing envisioned for the Semantic Web.  It is the XML language where much of the efforts at creating standardized representations in genealogy have occurred. 

 

XML was developed as a way to describe data contained in a web document and not, like HTML, to describe the way the data looks when it is displayed.  (See XML Tutorial for more details.)  XML still allows for the control of the presentation or display of the data on a web page but this is done through XML Stylesheets definitions. An XML document which defines how obituary notices should be structured for example, could specify that an obituary notice is a document  comprised of the deceased’s name, place of death, burial place, survivors, source of information, etc. (To see example XML tags click on this obituary notice and then select “Page Source” from the View menu.)  The tags or structure of an XML document is defined by a user in a DTD (Document Type Definition) file or by an XML Schema document. The DTD formally defines what elements (place of death, burial place, etc.) a document can or should have as well as how those elements should be structured (i.e. ordered) in the document.  XML Schema defines a particular set of terms that can be used as particular values.

 

After XML began to be widely adopted, the LDS Church produced an XML Version of the GEDCOM called GEDCOM XML.  It was distributed in 2002 for comment among the genealogy software development community.  GEDCOM XML uses a DTD instead of XML Schema to define its structures. The GEDCOM DTD gives specifications for what information can be specified for a family, individual, event, etc. (An example of the GEDCOM XML DTD to describe a family is given here.)  GEDCOM XML met with lukewarm reception by the development community because the shortcomings inherent in the traditional GEDCOM version.  Based on its beliefs about family and marriage, GEDCOM is built upon an underlying model which does not recognize non-traditional family constructs or families where a parent of a child is not known.  These constraints were carried over into the GEDCOM XML version.  (See More on LDS Church’s Adoption of the XML Standard).  Eventually GEDCOM XML was dropped by the Church altogether.  (See Clarification of the Use of GEDCOM/XML. This article was published before the standard was dropped but offers insight into the Church’s position.)

 

While the GEDCOM XML version was not adopted, it spawned a number of other XML-based projects. Several of these projects, including GedML, GedXML and a few others that did not gain any traction, were based on traditional GEDCOM. (See the CoverPages list of XML projects for other XML-based projects.)  GedXML is an example of a standard which uses XML Schema to define what values can be used but also incorporates aspects of the GENTECH genealogical process model mentioned earlier to expand its representation capacity. The values included for “eventType” for example, includes annulment, baptism, birth, burial, confirmation, death, immigration, internment, marriage, etc. (To see the schema, click here and then select “Page Source” from the View menu.)  The event types noted in the GEDCOM DTD only include birth, marriage and death.  Though XML can define the structure (or syntax) of a document and also specify what may or may not be included in that document, it does not “understand” what the content of the document is. For more understanding, we need RDF.

 

RDF

The next language layer built on top of XML is RDF (Resource Description Framework).  While XML tells what structure and elements a document can have, RDF provides a more generalized description for the document, though in the RDF language, the document is referred to as a “resource,” and paves the road to doing more processing with the data.  In RDF you are not limited to describing web documents; at its highest level, RDF allows you to describe anything (a document, an application etc.) on the Web by making statements about that item in a formal definition.  These statements are made in RDF using “triples.” A triple is made up of a “resource,” its “property” and the “value” for the property.  For genealogy we can describe a resource that captures a “relationship” between two people as expressed in the statement, ‘Jack’s father is Henry’. The triple used to represent this is ‘Jack hasFather Henry.’ In this case, ‘Jack’ is the resource, ‘hasFather’ is a property of the resource Jack and ‘Henry’ is the value of the property. With RDF we can describe abstract concepts and not just document structures as with XML. An example genealogy for George Washington written in RDF is given here.  This file shows George Washington’s date of birth (11 Feb 1731), the date of his death (14 Dec 1799) and the place of his death (Mt. Vernon, Fairfax, VA).  Though it is cumbersome (and not too intuitive) the line that begins with “<family rdf:resource …” indicates a resource for the family that George Washington belongs to.  Another example of a genealogy written in RDF is one for European royalty.  In this example, the properties ‘childIn,’ ‘spouseIn,’ and ‘birth’ and ‘death’ are given for each individual.

 

RDF Schema, a companion to RDF, provides a way of defining specific vocabularies in a hierarchy using the same “triple” statements mentioned above.  (Here are some links to sample vocabularies for describing relationships between people, and for biographical information.) By allowing different vocabularies that refer to the same thing to be specified by different groups, RDF Schema makes it possible for different communities to talk about the same thing using the language specific to their community. In addition, with vocabularies arranged in a hierarchy, RDF Schema makes it possible to make inferences about the data.  For example, we can define ‘father’ as a subgroup of ‘parent’ in a family relation hierarchy.  With the statement ‘Madeline hasFather Mark’, RDF Schema can infer that ‘Mark’ is the ‘parent’ of Madeline even though there isn’t a statement that explicitly states this is so.  By describing resources in this way, coupled with RDF and RDF Schema’s capabilities to make inferences, RDF and RDF Schema provide an encoding and interpreting tool which enables much more understanding and use of the information. The way RDF expresses statements is well suited to express family relationships.

 

OWL

The last (so far) of the W3C tools to realize the Semantic Web is OWL (Web Ontology Language).  As RDF was built on top of XML, OWL is layered on top of RDF.  Building on the features and capabilities of its predecessors, OWL is an ontology language (as is RDF) which describes the meanings of terms used in web documents or, in keeping with the RDF terminology, “resources”. An ontology is a description of the concepts of a domain, such as genealogy, and the relationships that exist between and among those concepts. (For an example of different types of general ontologies that have already been developed, click here.) Ontologies provide the meaning of data and information sources that can be processed by machines and also understood by software. OWL allows further specifications and restrictions on properties (like ‘hasFather’ above) and relationships between concepts than RDF. For example, it is possible in OWL to write a statement which says that one property is the ‘inverse of’ another.  The property ‘hasChild’ for example, can be expressed as the inverse of ‘hasParent’. The powerful thing about OWL is that with such a method of defining or restricting properties, it is possible for an OWL inference engine to make conclusions based on a statement(s).  By defining ‘hasChild’ as the inverse of ‘hasParent’, in OWL it is possible to infer from the statement ‘Martine hasChild Madeline’ that ‘Madeline hasParent Martine’ without the second statement being formally written.

 

A portion of the GEDCOM standard that has been encoded in OWL is located here.  (This example is actually written in DAML+OIL, a precursor to OWL.) An example of a family ontology also written in OWL that is not based on GEDCOM is the SWRL family ontology.  These two examples show different approaches to defining family relationships.  Important in the GEDCOM ontology is the concept of “events.” It specifies for example, “family” events (marriage and divorce) and events for an “individual” (birth and death).  Events are further specified with the maximum number of dates and places (1 for each) that can occur for an event.  In the SWRL ontology, a more granular approach is taken that does not focus on events.  Instead of focusing on events, this ontology defines people and family members by their properties or attributes.  In this example, a ‘women’ is defined as a ‘person’ whose ‘gender’ is ‘female.’  The concept of a ‘son’ is defined as a ‘man’ (male) who is also a ‘child’.  In OWL it is also possible to identify that two concepts are not the same.  ‘Son,’ for example is further defined as ‘disjointWith’ (OWL’s way of saying “is not of the same”) something defined as ‘daughter.’  The SWRL representation is more expressive than GEDCOM.

 

As the Semantic Web languages have progressed from XML to RDF and finally to OWL, it has become possible to not only define what the structure, elements and vocabularies a document can have, but also to make inferences from the content of the document (if specified in RDF Schema or OWL).  With the advances in these languages, computers now have a way of “understanding” representations and further to make use of them. 

 

Putting it All Together:  Semantic Web Projects

Much of the Semantic Web assumes (and requires) clear definitions and standardized ways of defining concepts.  For genealogy these concepts include, at the very least, family relationships, source documents and genealogy research processes.  While this article has focused on standards and examples using family relationships, each of these other areas should also be taken into consideration.  The issues that need to be considered to realize the Semantic Web vision for genealogy are not trivial.  Who will create these standards and definitions? Since genealogy research depends so heavily on vital records how can vital records and other source documents be standardized and expressed so that they too may be accessed and used by computers?  On a more practical level, how will current web pages, which are geared to displaying content and not defining its content, be translated into these definitions? 

 

Since the Semantic Web is at a very experimental stage, little is known how and if tools currently in development will actually be used.  Despite the complexities however, given the expressiveness of the current OWL language, it is possible to imagine scenarios which these technologies may facilitate genealogy and family history research.  Imagine for example, two data files: one for a family genealogy where the name and date of death for an individual are known but the place of death isn’t; and a second one describing war battles tagged with the date of the battle, the names of those who perished in the battle and the place in which the battle took place.  Provided both data files are written in a standardized and machine-readable way (as with the OWL language), it may be possible with other Semantic Web technologies (such as web services) to automatically discover the place of death of the member and further to add other missing information to the record.  Just as the Internet changed genealogy research by enabling the posting and sharing of genealogy data on the web, so too may the Semantic Web change genealogy research by “understanding” our genealogy data and then helping us to compile it.  

 

The remainder of this article will provide annotations of recent research projects related to genealogy and family history research which illustrate applications of Semantic Web technologies or more broadly efforts towards realizing the Semantic Web vision.  The area of genealogy is particularly interesting when studying the Semantic Web because of the significant overlap the field has with other fields such as history, anthropology, and sociology. This overlap can be seen by the various fields from which these articles come.

 

Research Articles:

Towards a Genealogical Ontology for the Semantic Web

By Ivo Zanhuis

Ivo Zanhuis Research & Consultancy

Published in the 2005 Proceedings of the XVI International Conference of the Association for History and Computing, this article introduces two ontologies that are different from GEDCOM. It introduces “genont,” an ontology to model personal information and “srcont,” an ontology to model original source information.  To see the results of the research project, see the AHC site for this paper, and select either sources.owl.rdf or conclusions.owl.rdf.  (To see the files in OWL format, select “View Source (or “Page Source”) from the View menu.)

 

Automating the Extraction of Genealogical Information from the Web

By Troy Walker and David W. Embley

Department of Computer Science, Brigham Young University, 2004

This article from the Data Extraction Research Group at Brigham Young University, a group that does quite a bit of research in the field, reports on a tool to extract information from arbitrary genealogy web pages and place the information in a database.  The examples used in this paper provide concrete examples of the difficulty raised by the variation of data representations and the use of an ontology for genealogy.

 

Integrating Knowledge, Semantics and Digital Media into the Multimedia Generation Process

By Lyndon J.B. Nixon

Networked Information Systems, Free University of Berlin

This article goes beyond genealogy research but gives an idea of an information services that could be used for genealogy. (See Example 6.) The example takes a person’s name, a genealogy ontology and creates a visual representation of the family tree. Information from annotated photographs and genealogical information are used. 

 

Architecting a Search Engine for the Semantic Web

By David E. Goldschmidt and Mukkai Krishnamoorthy

Rensselaer Polytechnic Institute

This article illustrates the use of RDF representation as a way of improving search results.  It describes a prototype of a new type of search engine based on inferences.

 

Retrieving Danish Genealogical Records on the Semantic Web, Technical Report

By Charla Woodbury

Department of Computer Science, Brigham Young University, 2004

This article also proposes a new type of search engine for doing genealogy research which returns information relevant to the query (and eliminates irrelevant material).  It provides an interesting example using Danish vital records as an example.

 

Family History Research on the Semantic Web: Building a Semantic Prototype for Danish Research*

By Charla Woodbury and David W. Embley

Department of Computer Science, Brigham Young University, 2005

This article provides a brief overview of the Danish genealogy project mentioned in the previous article. 

* Contact BYU Family History Technology Workshop for article availability.

 

GEDCOM CGI Protocol and Web Services*

By John Finlay

Family History and Technology Workshop, 2005

This article gives an introduction to a communication protocol for use by different genealogy systems (applications) to communicate with each other.

* Contact BYU Family History Technology Workshop for article availability.

 

High-Level View of a Source-Centric Genealogical Model: “The Model with Four Boxes”*

By Randy Wilson of the LDS Church, David Ouimette and Dan Lawyer

Family History and Technology Workshop, 2006

This article proposes another model for genealogy. The four portions of the model include areas to store source information, artifact information such as scanned images, a structured data archive and an area to store family tree information.

* Contact BYU Family History Technology Workshop for article availability.

 

Translating Between Different Ontologies

Dejing Dou, Drew McDermott and Peishen Qi

Computer Science Department, Yale University

In creating ontologies, the question arises of what to do with different ontologies that cover the same field.  This article discusses the merging and translation of ontologies.  It uses genealogy as an example. It shows how different attributes and properties from one ontology may or may not be reflected in another ontology. 

 

Text Encoding Initiative’s Report on XML mark-up of biographical and prosopographical data

By Text Encoding Initiative (TEI)

This report compares and evaluates different schemes that are used to markup biographical information.  It is included here because it also shows the overlap between different markup languages used for genealogy and those used in different fields such as history. 

 

Kinship, Computing, and Anthropology*

By Stephen M. Lyon and Simeon S. Magliveras

Durham University

Contained in Social Science Computer Review, Volume 24, Number1, 2006

This article reviews different genealogy software packages and details how each addresses kinship.  Though written for anthropologist, the issues defining relationships within families are interesting.

* Available by subscription only. Contact Sage Publications, http://online.sagepub.com.

 

The Semantic Web for Family History

By Jay Askren

This article provides a concise overview of the Semantic Web and genealogy.  The example above for George Washington is from this page. This page contains a table from which you can see example genealogies written in different languages including HTML, XML, GEDCOM XML and RDF. 

 

Genealogy in the New Times

By Gary B. Hoffman

Written in 1999, this article provides a call to genealogist to change their view of genealogy research. Though not related to the Semantic Web, it includes a view of genealogical research in which genealogist seek to answer questions about how people lived, and what their world views were, etc. instead of simply discovering births, deaths, marriages and vital statistics.  Could Semantic Web technologies be used to discover such information?

 

Other References:

Genealogy in the Library. Otis G. Hammond. Address read before the New Hampshire Library Association in 1905. Published in 1906.

 

MARGEDOS: a MARC-like format for genealogy.  Contained in Genealogy & Computers. Edited by Charles Clement. From the Proceedings of the RASD History Section, Genealogy Committee Program, Reference and Adult Services Division, American Library Association. Published in 1985.

 

GEDCOM: A Format for Genealogical Communications.  Contained in Genealogy & Computers. Edited by Charles Clement. From the Proceedings of the RASD History Section, Genealogy Committee Program, Reference and Adult Services Division, American Library Association. Published in 1985.

 

Records and Record Searching: a guide to the genealogist and topographer. Walter Rye. Published in 1897.

 

Your Family History: how to use oral history, personal family archives, and public documents to discover your heritage. Allan J. Lichtman. Published in 1978.

 

Genealogical Research on the Web. Diane K. Kovacs. Published in 2002.

 

Genealogical Devotional Address. Addresses contained in the Fourth Annual Priesthood Genealogical Research Seminar.  Brigham Young University. 1969

 

XML – A Replacement for GEDCOM? Martin Vlietstra. Contained in Computers in Genealogy. 9/2001.

 

Retrospect and Prospect – Five Years On. Peter Christian. Contained in Computers in Genealogy. 9/2001.