Handling Mixed Content in a Strongly-Typed World

If you've been working with XML as a format for representing data, its usefulness for that purpose is probably obvious to you. Likewise, if you've been working with it as a format for representing narrative text (XHTML is an example), the structure XML provides has probably been handy. But if you've ever taken a programming model that's designed for one purpose and tried to bend it to the other, there's a good chance you've cursed a bit over it. That line between data and text can feel like a wall. In this article, I'm going to introduce a way to keep the strongly-typed access that's best suited for data while also handling the mixed content that comes with narrative text.

An API that uses strong types to represent the XML is great for getting tidy pieces of data out and putting them in. If you’re using XMLBeans, a technology currently incubating at Apache, you’ve probably been getting at your XML using the types you can generate from schema. Imagine having the following XML and a schema that describes it.

<item id="123456">
    <name>flangie</name>
    <description>Brand new technology to accompany our callioscim for those really 
        tough jobs.</description>
</item>

Compiling the schema would generate types that provide methods such as getItem(), setName(String), and so on. In your code, you’d ease into a getting and setting rhythm that would make handling the XML pretty straightforward – until you needed to handle something like this:

<item id="123456">
    <name>flangie</name>
    <description>Brand new technology to accompany our 
        <item-link id="654321">callioscim</item-link> 
        for those really tough jobs.</description>
</item>

In this case, the <description> element contains what’s known as “mixed content“. Mixed content is element content that contains a mixture of child elements and character data. Sometimes mixed content takes the form of frustratingly intermingled text and embedded elements, as in the preceding example. With XMLBeans’ generated types, a getItemLinkArray() method would give you all of the embedded <item-link> elements, but you’d get them as, well, an array. The array wouldn’t be able to tell you what text came before and after each <item-link> – in other words, it would lack the context that gives the each <item-link> element’s value part of its meaning. Setting the <description> element’s mixed content, needing text-element-text, would be as frustrating.

Minding your place with a cursor

As it turns out, if you’re willing to venture over that data/text line, XMLBeans provides a way to get at mixed content. With its XmlCursor interface you can step outside of the getting-, setting-, data-oriented world and move into text (and “tokens“, as we’ll see). Let's start with this very simple XML snippet:

<item id='123456'/>

The XmlCursor interface provides a way for you to “walk“ the XML, moving past elements, attributes, and text, and even moving character by character through text. An XmlCursor instance sees your XML as a series of tokens, each of which belongs to a category, called a “token type“. The following example shows a cursor traversing the XML above, printing the token type and corresponding XML along the way.

// Put a snippet of XML into an XMLBeans type.
XmlObject itemXml = XmlObject.Factory.parse("<item id='123456'/>");

// Insert a new cursor at the very top of the document containing the XML. 
XmlCursor cursor = itemXml.newCursor();

/* 
 * Before moving the cursor anywhere, print out the XML and token type
 * where it is now, at the outset.
 */
System.out.println(cursor.currentTokenType().toString());
System.out.println(cursor.toString());

/*
 * Start moving the cursor forward. The toNextToken().isNone() method
 * returns true if there aren't any more tokens after the cursor to move to. At
 * each token, print out the token type and corresponding XML.
 */
while (!cursor.toNextToken().isNone())
{
    System.out.println(cursor.currentTokenType().toString());
    System.out.println(cursor.toString());
}

// Signal that the cursor may be garbage collected.
cursor.dispose();

The code prints this:

STARTDOC
<item id="123456"/>

START
<item id="123456"/>
ATTR
<xml-fragment id="123456"/>
END
<xml-fragment/>
ENDDOC
<xml-fragment/>

The STARTDOC and ENDDOC tokens are bookends for the XML document as a whole, but they don’t represent any piece of the XML’s content. The START and END tokens, on the other hand, represent the start and end of the <item> element. The ATTR token represents, of course, the id attribute. As you can see, XMLBeans attempts to print valid XML for each token along the way, even when the token corresponds to something that’s not valid on its own – such as the id attribute, the end of the <item> element, or the end of the document.

In other words, through the cursor’s eyes, this little snippet of XML looks something like this:

But what, you might be asking, does this have to do with mixed content? What I’m getting to is that a cursor does its work outside the boundary of schema types. To see what I mean, take a look at the following example. It shows how to create, cursor-style, the <item> element we started with:

// Create a new, empty XmlObject. This is a blank slate.
XmlObject itemXml = XmlObject.Factory.newInstance();

/*
 * Insert a new cursor. After this method executes, the cursor's
 * position is at STARTDOC; we're not allowed to put XML here,
 * so we have to move the cursor forward a token.
 */
XmlCursor cursor = itemXml.newCursor();

/*
 * Move the cursor to the first place where XML is allowed. After
 * this move, the cursor is between STARTDOC and ENDDOC. The
 * "current" token type is ENDDOC because that's what the cursor
 * immediately precedes.
 */
cursor.toFirstContentToken();

/*
 * Insert the new element, its attribute and the attribute's value,
 * then dispose of the cursor.
 */
cursor.beginElement("item");
cursor.insertAttributeWithValue("id", "123456");
cursor.dispose();

As you can see, there are no set methods, no schema types. It looks a little like DOM-oriented code. The XmlCursor interface provides a variety of ways to insert new XML. The beginElement method I used here creates the new <item> element and puts the cursor between the START and END tokens, like this:

After calling the insertAttributeWithValue method, the cursor’s view looks like this (notice how the cursor shifts to the right with each inserted item):

If the code had gone on to insert text using the cursor's insertChars method, the cursor's view of the XML would have been something like this:

The TEXT token is grey here to indicate that it's possible to have the cursor dive into the text corresponding to a TEXT token, moving over the text as chars. That's how we're going to use the cursor to insert intermingled character data and elements.

Mixing it up

To show the cursor doing some real mixed content work, I’ll write code that assembles XML like the mixed content snippet I started out with. Remember that it had a <description> child element that contained character data and an element intermingled, like this:

<item id="123456">
    <name>flangie</name>
    <description>Brand new technology to accompany our 
        <item-link id="654321">callioscim</item-link> 
        for those really tough jobs.</description>
</item>

In the application I’m thinking of, this is a snippet from a much larger document full of <item> elements – maybe a list for a catalog. The list is cross referenced, meaning that the callioscim item mentioned in the preceding description actually corresponds to another <item> in the list. The <item-link> element is a way to do the cross referencing, which could be useful for making hyperlinks when the XML is transformed to HTML.

Part of the application’s process for adding a new item to the catalog is to search the new item’s description for the names of other items in the catalog, then convert those names to <item-link> elements with the correct id. The purpose of the following code would be to take the description text for an item about to be added and use it, along with a list of items already in the catalog, to create cross references. Because the application uses schema to handle and validate the XML, this code returns an XMLBean generated from schema, a DescriptionType instance, that can be inserted into the new item XML.

    
/**
 * Processes item description text, searching for the names of items mentioned
 * in the description and replacing them with cross references to the listed item.
 * 
 * @param descriptionText The description to process.
 * @param itemArray A list of the items that might be mentioned in the description. 
 * @return A <description> element containing the processed description.
 */
private DescriptionType processItemDescription (String descriptionText, ItemType[] 
    itemArray)
{
    /*
     * Create a new DescriptionType right off the bat to hold the description.
     */
    DescriptionType itemDescription = DescriptionType.Factory.newInstance();

    /*
     * Create a cursor at the edge of the description XML, then move it into position.
     */
    XmlCursor descriptionCursor = itemDescription.newCursor();
    descriptionCursor.toFirstContentToken(); 

    /*
     * Insert the description text as we received it, whether or not it contains
     * item names we want to cross reference. If there aren't any items named
     * we'll just return this as is.
     */
    descriptionCursor.insertChars(descriptionText);

    /* 
     * As before, inserting something shifted the cursor to the right, to the
     * <description> element's END token. Calling toPrevToken moves the cursor 
     * back to the beginning of the text (a TEXT token, not surprisingly). From    
     * here we can begin searching for item names.
     */ 
    descriptionCursor.toPrevToken();

    /*
     * Here's where the real work is done. This code loops through the list of known
     * items, using the name of each to search the description text for a match.
     * If a match is found the matching text is replaced with an <item-link> 
         * element.
     */
    for (int i = 0; i < itemArray.length; i++)
    {
        // Isolate the current item and take its name.
        ItemType linkedItem = itemArray[i]; 
        String linkedItemName = linkedItem.getName();

        /* 
         * If the item name is in the description text, set about replacing it.
         * If not, then just move on to the next item.
         */
        if (descriptionText.indexOf(linkedItemName) > 0)
        {
            /*
             * We have a match, so store the cursor's location before
             * moving it. The push method stores the cursor's current location,
             * allowing us to come back to this location (the beginning of the
             * description text) quickly with a call to the pop method.
             */
            descriptionCursor.push();

            /*
             * Move the cursor to a location immediately preceding the
             * matching text. The toNextChar method takes an int that
             * specifies the number of chars to move.
             */
            int charsMoved = descriptionCursor.toNextChar
                        (descriptionText.indexOf(linkedItemName));

            /* 
             * Remove the matching name. The removeChars method takes
             * and int that specifies how many chars to remove.
             */
            descriptionCursor.removeChars(linkedItemName.length());

            /* 
             * Move the cursor back to the beginning of the text, then store
             * the location again.
             */
            descriptionCursor.pop();
            descriptionCursor.push();

            /*
             * Move the cursor back to where we removed the item name,
             * then use beginElement to insert its replacement. Here, we use
             * the version of the method that takes a namespace URI – 
             * remember that the <description> and <item-link> 
             * elements are defined in a schema. We want to ensure that edits  
             * we make are accompanied by the schema's namespace URI.
             */
            charsMoved = descriptionCursor.toNextChar
                        (descriptionText.indexOf(linkedItemName));
            descriptionCursor.beginElement("item-link", 
                "http://widgetsco.com/inventory/Inventory.xsd");

            /*
             * Insert an id attribute with a value retrieved from the matching 
             * item in the list.
             */
            descriptionCursor.insertAttributeWithValue("id", 
                String.valueOf(linkedItem.getId()));

            /* 
             * Insert the name of the matching item as the <item-link> 
             * element's content.
             */
            descriptionCursor.insertChars(linkedItemName);

            // Move the cursor back to the stored location to start over again.
            descriptionCursor.pop();
        }
    }

    /*
     * Ensure that the work we've done has created valid XML. If it 
     * hasn't, then return null so that the caller can handle it. There's a 
     * version of the validate method that allows you to store information 
     * about what went wrong if the XML is invalid, but that would make this  
     * example a bit long.
     */
    if (!itemDescription.validate()){itemDescription = null;}
    return itemDescription; 
}

Finally, you may be realizing another aspect of the type-agnostic work of cursors: it’s startlingly easy to create XML that doesn’t conform to your schema, or to edit your XML into an invalid state. While it’s generally true that XMLBeans permits you to create invalid XML even when you’re using schema types, it’s even easier to do so with cursors. In other words, when schema matters, be sure to validate XML you build by using the validate method.