*
Microsoft.com Home|Site Map
MSDN*
|Developer Centers|Library|Downloads|How to Buy|Subscribers|Worldwide
Search for


Advanced Search
MSDN Magazine Archive CD
Microsoft TechEd 2007
MSDN Magazine 
The XML Files
XML Data Migration Case Study: GEDCOM


Download ImageGet the sample code for this article.
Code download available at: XMLFiles0405.exe (143KB)



Contents


XML's ubiquity and continually improving tool support has created a magnetism that attracts organizations everywhere. As organizations move to XML, they must also provide a coherent data migration strategy that allows their users to bring old files forward. This is a nontrivial problem that typically requires tedious code to implement the transformation process. The System.Xml namespace in the Microsoft® .NET Framework, however, can greatly simplify these data migration challenges through its extensible APIs and support for XSLT.

An example of data migration is what's currently happening around genealogy data formats. Genealogists have long relied on the Genealogical Data Communications (GEDCOM) 5.5 format for sharing genealogical information. GEDCOM 5.5 is text based but not XML based. A beta version of GEDCOM 6.0 is available and is completely based on XML. But what about the gigabytes of genealogical information that can still be found in GEDCOM 5.5 format? This presents an interesting data migration challenge that really should not be ignored.

In this column I'll walk you through solving this data migration problem using System.Xml. This process can serve as a blueprint for other data migration problems you may face.


GEDCOM 101

GEDCOM was developed to facilitate exchanging genealogical data across different genealogy programs and systems. A common format like GEDCOM allows users to share their work with others regardless of the program they're using.

Version 5.5 is the most widely used version of the various GEDCOM specifications (see The GEDCOM Standard Release 5.5). GEDCOM 5.5 relies on a simple text-based grammar that leverages line delimiters and level numbers to structure family tree information. Figure 1 provides a sample GEDCOM 5.5 file.

Each GEDCOM line contains a level number, a tag, and an optional value. Multiple lines constitute a GEDCOM record. A level of 0 marks the beginning of a new GEDCOM record. Every line that follows is part of the record until you reach another line with a level of 0. The tag name conveys meaning about the information on the line as defined in the specification.

In the example shown in Figure 1, HEAD is the first record and it contains six children (SOUR, DATE, GEDC, CHAR, SUBM, SUBN). SOUR contains two children (VERS and NAME) while DATE contains one child (TIME). The increasing level numbers indicate parent-to-child relationships. A line may also contain a unique identifier, as shown in the following snippet:

0 @SUB01@ SUBM
1 NAME Aaron Skonnard
1 ADDR 456 Main
2 CONT Salt Lake City, Utah 84150
In this example, the SUBM record has a unique identifier of SUB01. This ID must be unique within the scope of the document. Identifiers make it possible to establish links between records. For example, the HEAD record referenced this SUBM record on the following line of code:
1 SUBM @SUB01@

These are the basics of the GEDCOM 5.5 grammar. GEDCOM 5.5 is capable of representing sophisticated tree structures through its use of level numbers, identifiers, and cross-referencing capabilities. I don't have enough space in this column to cover GEDCOM semantics in more detail. Suffice it to say that GEDCOM 5.5 makes it possible to express a wide range of genealogy-related information.

Since GEDCOM was primarily designed to represent tree structures (family trees) in an interoperable manner, moving GEDCOM to an XML format seems like a perfect fit.

Back to top

Mapping GEDCOM to XML

The GEDCOM 5.5 grammar maps nicely to XML. One way to define a simple XML mapping is to convert each GEDCOM line into an XML element with the same name. The line's optional data, if any, can be placed in an attribute named "value". Identifiers can be placed in an attribute named "id" and references to other records can be placed in an attribute named "idref". Figure 2 may help you visualize the mapping. Applying this simple mapping to the GEDCOM 5.5 file shown in Figure 1 produces the XML document shown in Figure 3.

Figure 2 Mapping GEDCOM 5.5 to GEDCOM XML
Figure 2 Mapping GEDCOM 5.5 to GEDCOM XML

The XML version conveys the same information, but now it can be processed by a much wider range of tools and technologies. For example, you could process the XML file with your favorite XML API (such as DOM, SAX, XmlTextReader, or XPathNavigator), a query language like XPath or XQuery, or a transformation language like XSLT. Ultimately, once you have GEDCOM data in XML format, you can do just about anything with it.

Back to top

Designing the Transformation

So how do you transform the GEDCOM 5.5 file into XML format? You can't use XSLT because the input format, GEDCOM 5.5, is not XML based. Instead, you have to write code that manually parses the GEDCOM 5.5 format and produces the XML format shown in Figure 3. Using the .NET Framework, the easiest way to accomplish both of these tasks at the same time is to write a custom XmlReader class.

I've provided a complete sample that illustrates how this is done (you can download the sample from the link at the top of this article). The sample contains several classes that work together in translating the GEDCOM 5.5 information into a logical XML document. To start, I defined an interface called IGedcomNode that several other classes implement. This interface models a logical XML node that contains information from a GEDCOM source. It allows the consumer to access the node's XML name, value, and type, as well as the GEDCOM 5.5 level number that appears on each GEDCOM line. I'll need the level number later on in order to figure out when a node ends and when a new node begins.

Then I defined several concrete classes to model the different types of XML nodes that will make up the resulting XML document (see Figure 4). These include GedcomElement, GedcomAttribute, and GedcomText. Each of these classes implements the IGedcomNode interface. I also defined a class called IGedcomLine whose purpose is to manage a sequence of IGedcomNode objects that came from a particular GEDCOM line.

Back to top

GedcomReader Implementation

The meat of the implementation is found in the GedcomReader class. GedcomReader derives from System.Xml.XmlReader and overrides its abstract members. The fundamental purpose of GedcomReader is to take care of parsing an underlying GEDCOM 5.5 file and turn it into an XML document. Hence, the consumer doesn't need to know anything about GEDCOM 5.5. The consumer only needs to be familiar with the XML format as shown in the code in Figure 3.

My implementation of GedcomReader has a single constructor that takes a GEDCOM 5.5 file name. The constructor opens the supplied GEDCOM 5.5 file with a System.IO.StreamReader object and prepares to begin parsing. It also provides a Close method that closes the underlying file stream when the user is finished.

The Read method (see Figure 5) does the bulk of the work because it advances the logical cursor through the document. Therefore, the implementation of Read must deal with the translation between the GEDCOM 5.5 format and the target XML format that I'm trying to simulate (see Figure 3).

The implementation uses a stack to keep track of the current XML node and its ancestors. Read will periodically push nodes onto the stack and eventually pop them off as the cursor moves through the underlying document. The node at the top of the stack is considered the current node.

The implementation of Read first checks the ReadState to see if this is the first call to Read. If so, it creates an XML element named "GEDCOM" (using a GedcomElement object) and pushes it onto the stack—this is the root element of the target format (see Figure 3). Then it sets the ReadState to "Interactive".

On future calls to Read, the implementation will read a line from the underlying GEDCOM 5.5 file and call ParseGedcomLine, which then produces a GedcomLine object containing several IGedcomNode objects. The GedcomLine object serves as a cache. This is necessary because each call to Read only processes a single XML node. So after a line has been parsed, future calls will process the remaining nodes in the GedcomLine object until the end of the cached collection is reached. At this point, a new line is parsed and the process repeats itself.

The other tricky thing that Read has to deal with is inserting XmlNodeType.EndElement nodes in the correct sequence.

Back to top

Parsing GEDCOM 5.5

The parsing code (see ParseGedcomLine) takes care of the GEDCOM 5.5 parsing details and generating the proper XML format using the various IGedcomNode-derived classes. You can change the code in this method to customize the target XML format. The code shown in Figure 5 produces the XML format shown in the code in Figure 3.

With this code in place, overriding the remaining XmlReader members becomes quite simple (see Figure 6). For example, the implementations of LocalName and Name pull the current node from the top of the stack and ask for its name through the IGedcomNode interface. The override for NodeType does the same thing. As you can see, the core transformation functionality takes place in the implementation of Read. You can download the complete sample for more details on the remaining implementations.

Back to top

Using GedcomReader

With the completed GedcomReader implementation, you can now process GEDCOM 5.5 documents as if they were really XML documents structured like the one shown in Figure 3. For example, you can begin writing XmlReader-based code that streams through the file and inspects each XML node one at a time:

GedcomReader gr = new GedcomReader("skonnord.ged");
while (gr.Read())
   Console.WriteLine("{0}: {1}", gr.NodeType, gr.Name);
Or you could load the document into an XmlDocument and process it using XPath. For example, the following code snippet identifies the INDI elements that have a NAME element containing 'Bernt' in its value:
GedcomReader gr = new GedcomReader("skonnord.ged");
XmlDocument doc = new XmlDocument();
doc.Load(gr);
gr.Close(); // done using GedcomReader
XmlNode indi = doc.SelectSingleNode(
  "/GEDCOM/INDI[contains(NAME/@value, 'Bernt')]");

After loading the GEDCOM 5.5 source into an XmlDocument object, you can serialize it to XML 1.0 using the Save method:

GedcomReader gr = new GedcomReader("skonnord.ged");
XmlDocument doc = new XmlDocument();
doc.Load(gr);
gr.Close(); // done using GedcomReader
doc.Save("skonnord.xml");

The generated skonnord.xml file will look like the file shown in Figure 3. Loading the GEDCOM 5.5 file and saving it achieves the complete file transformation. You can even take the transformation one step further by using XSLT to translate the intermediate XML format into something else.

Back to top

XSLT for GEDCOM 6.0

The latest GEDCOM specification, version 6.0, defines a full-fledged XML format for GEDCOM information. The specification even provides a Document Type Definition (DTD) that defines the elements and attributes that make up the complete GEDCOM 6.0 vocabulary.

The GEDCOM 6.0 format is much different from the one my GedcomReader simulates. In order to migrate to GEDCOM 6.0, you have to either modify the GedcomReader implementation in order to simulate the new format or write an XSLT that performs the transformation in a subsequent step.

The GEDCOM 6.0 format is much more complex than the simple mapping I simulated in GedcomReader. Consequently, trying to simulate GEDCOM 6.0 in the GedcomReader code would be extremely difficult and error prone. Using an XSLT transformation to accomplish this step is a more tractable problem.

In Figure 7 I've provided an XSLT in the sample project that illustrates how to generate a GEDCOM 6.0 file from the intermediate XML format shown in Figure 3. The XSLT covers the most common GEDCOM 6.0 use cases, and it produces files that pass validation against the GEDCOM 6.0 DTD.

You can use this XSLT by taking advantage of the System.Xml.Xsl.XslTransform class, as shown here:

GedcomReader gr = new GedcomReader(gedcomFileName);
XmlDocument doc = new XmlDocument();
doc.Load(gr);
gr.Close(); // done using GedcomReader

XslTransform tx = new XslTransform();
tx.Load("gedcom6.xsl");
FileStream fs = 
   new FileStream("skonnord6.xml", FileMode.Create);
tx.Transform(doc, null, fs, null);

The output file, skonnord6.xml, will now be GEDCOM 6.0-compliant. At this point it's also possible to write other XSLT transformations that change the intermediate XML format into another format of your choosing. For example, you could write an XSLT transformation that produces human-readable HTML pages for viewing and navigating the GEDCOM family tree information.

As you can see, it didn't take much code to implement a complete GEDCOM migration path. The ability to programmatically move between GEDCOM 5.5 and GEDCOM 6.0 greatly simplifies the migration scenarios that are involved in building a complete genealogy system around the new GEDCOM 6.0 data model.

Back to top

Where Are We?

I've walked through a real-world data migration scenario using the System.Xml classes in .NET. I started with the GEDCOM 5.5 format and moved to an intermediate XML format that can be processed in a variety of ways. This is made possible by a custom XmlReader implementation called GedcomReader.

The intermediate XML format can also be transformed into a variety of other formats. I was able to migrate to GEDCOM 6.0 by using an XSLT transformation (see Figure 8).

Figure 8 Migrating to GEDCOM 6.0
Figure 8 Migrating to GEDCOM 6.0

The extensibility points provided by System.Xml facilitate dealing with a variety of data migration scenarios like this one. If you have old data formats lying around that would benefit from migrating to XML, follow the approach discussed in this column to implement your own migration path.

Back to top

Send your questions and comments for Aaron to  xmlfiles@microsoft.com.
Download ImageCode download available at: XMLFiles0405.exe (143KB)

Aaron Skonnard teaches at Northface University in Salt Lake City. Aaron coauthored Essential XML Quick Reference (Addison-Wesley, 2001) and Essential XML (Addison-Wesley, 2000), and frequently speaks at conferences. Reach him at http://www.skonnard.com.

Subscribe  From the May 2004 issue of MSDN Magazine.
Back to topBack to top QJ: 040510

© 2007 Microsoft Corporation and CMP Media, LLC. All rights reserved; reproduction in part or in whole without permission is prohibited.

    © 2007 Microsoft Corporation. All rights reserved. Terms of Use |Trademarks |Privacy Statement
    Microsoft