Skip to main content

If you don't have an IBM ID and password, register here.

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

Parsing, indexing, and searching XML with Digester and Lucene

These open source projects can ease your XML-handling tasks

Otis Gospodnetic (otis@apache.org), Software Engineer, Wireless Generation, Inc.
Otis Gospodnetic is an active Apache Jakarta member, a developer of Lucene, and maintainer of the jGuru's Lucene FAQ. His professional interests include Web crawlers, information gathering and retrieval, and distributed computing. Otis currently lives in New York City and can be reached at otis@apache.org.

Summary:  Java developers can use the SAX interface to parse XML documents, but this process is rather complex. Digester and Lucene, two open source projects from the Apache Foundation, cut down your development time for projects in which you manipulate XML. Lucene developer Otis Gospodnetic shows you how it's done, with example code that you can compile and run.

Date:  03 Jun 2003
Level:  Intermediate

Comments:  

If you've ever wanted to parse XML documents but have found SAX just a little difficult, this article is for you. In this article, we examine how to use two open source tools from the Apache Jakarta project, Commons Digester and Lucene, to handle the parsing, indexing, and searching of XML documents. Digester parses the XML data, and Lucene handles indexing and searching. You'll first see how to use each tool on its own and then how to use them together, with sample code that you can compile and run.

About Digester and Lucene

Commons Digester is a subproject of the Commons project, which is one of the initiatives developed by the community of developers who create open source software under the Apache Jakarta umbrella. Digester offers a simple and high-level interface for the mapping of XML documents to Java objects. When Digester finds developer-defined patterns in XML, it will take developer-specified actions. Digester requires a few additional Java libraries, including an XML parser compatible with either SAX 2.0 or JAXP 1.1. Digester's home page, listed in the Resources section at the end of this article, provides a short list of the libraries that Digester needs.

Lucene is another Apache Jakarta project. Like Digester, it is a Java library and not a stand-alone application. Behind its simple indexing and search interface hides an elegant piece of software capable of handling many documents.

In the rest of this article, we use Digester to parse a simple XML file, then illustrate how Lucene creates indices. Then we marry the two tools to create a Lucene-generated index from our sample XML document, and finally use Lucene classes to search through that index.


Using Digester to parse XML

We use Digester to parse the simple XML document in Listing 1, which contains entries in an imaginary address book. To demonstrate handling of elements with and without attributes, I decided to make type an attribute of the <contact> element, while leaving all other elements without any attributes.


Listing 1. XML snippet of a fictitious address book
<?xml version='1.0' encoding='utf-8'?>
<address-book>
    <contact type="individual">
        <name>Zane Pasolini</name>
        <address>999 W. Prince St.</address>
        <city>New York</city>
        <province>NY</province>
        <postalcode>10013</postalcode>
        <country>USA</country>
        <telephone>1-212-345-6789</telephone>
    </contact>
    <contact type="business">
        <name>SAMOFIX d.o.o.</name>
        <address>Ilica 47-2</address>
        <city>Zagreb</city>
        <province></province>
        <postalcode>10000</postalcode>
        <country>Croatia</country>
        <telephone>385-1-123-4567</telephone>
    </contact>
</address-book>

Using Digester to parse the above XML document is very simple, as Listing 2 illustrates. (Clicking Listing 2 causes a new browser window to open. Keep that window open so you can refer to Listing 2 while reading the following discussion.)

The most involved part of using Digester is centralized in the main() method. After creating an instance of Digester, we have to create rules for actions that are to be triggered when certain patterns are encountered in the XML document that we are parsing. We'll look in more detail at each Digester rule that we defined in Listing 2. Note that the order in which rules are passed to Digester matters a great deal.

The first rule tells Digester to create an instance of the AddressBookParser class when the pattern "address-book" is found. Because <address-book> is the first element in the address book XML file, this rule will be the first to be triggered when we use Digester with our XML file.

digester.addObjectCreate("address-book", AddressBookParser.class);

This next rule instructs Digester to create an instance of class Contact when it finds the <contact> child element under the <address-book> parent.

digester.addObjectCreate("address-book/contact", Contact.class);

In the following snippet, we set the type property of the Contact instance when Digester finds the type attribute of the <contact> element.

digester.addSetProperties("address-book/contact", "type", "type");

Our AddressBookParser class contains several rules that look similar to the one shown below. They instruct Digester to invoke the setName() method of the Contact class instance and use the value enclosed by <name> elements as the method parameter.

digester.addCallMethod("address-book/contact/name", "setName", 0);

Finally, this rule tells Digester to call the addContact() method when it finds the closing </contact> element.

digester.addSetNext("address-book/contact", "addContact");

Again, it's important that you consider the order in which the rules are passed to Digester. While we could change the order of various addSetProperties() rules in our class and still have properly functioning code, switching the order of addObjectCreate() and addSetNext() would result in an error.


Using Lucene to index text

There are four fundamental Lucene classes for indexing text: IndexWriter, Analyzer, Document, and Field.

The IndexWriter class creates a new index and adds documents to an existing index.

Before text is indexed, it is passed through an Analyzer. Analyzer classes are in charge of extracting indexable tokens out of text to be indexed and eliminating the rest. Lucene comes with a few different Analyzer implementations. Some of them deal with skipping stop words (frequently used words that don't help distinguish one document from the other, such as a, an, the, in, and on), for instance, while others deal with converting all tokens to lowercase letters, so that searches are not case sensitive.

An index consists of a set of Documents, and each Document consist of one or more Fields. Each Field has a name and a value. You can think of a Document as a row in an RDBMS, and Fields as columns in that row.

Let's consider a simple scenario in which we add a single contact entry with all its fields to the index. Listing 3 shows how we could do it, using the classes we just described.


Listing 3. Lucene-based address book indexer
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.WhitespaceAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;


/**
 * <code>AddressBookIndexer</code> class provides a simple
 * example of indexing with Lucene.  It creates a fresh
 * index called "address-book" in a temporary directory every
 * time it is invoked and adds a single document with a
 * few fields to it.
 */
public class AddressBookIndexer
{
    public static void main(String args[]) throws Exception
    {
        String indexDir =
            System.getProperty("java.io.tmpdir", "tmp") +
            System.getProperty("file.separator") + "address-book";
        Analyzer analyzer = new WhitespaceAnalyzer();
        boolean createFlag = true;

        IndexWriter writer = new IndexWriter(indexDir, analyzer, createFlag);
        Document contactDocument  = new Document();
        contactDocument.add(Field.Text("type", "individual"));
        contactDocument.add(Field.Text("name", "Zane Pasolini"));
        contactDocument.add(Field.Text("address", "999 W. Prince St."));
        contactDocument.add(Field.Text("city", "New York"));
        contactDocument.add(Field.Text("province", "NY"));
        contactDocument.add(Field.Text("postalcode", "10013"));
        contactDocument.add(Field.Text("country", "USA"));
        contactDocument.add(Field.Text("telephone", "1-212-345-6789"));
        writer.addDocument(contactDocument);
        writer.close();
    }
}

What exactly is happening here? Lucene indices are stored in directories in the filesystem. Each index is contained within a single directory, and multiple indices cannot share a directory. The first parameter in IndexWriter's constructor specifies the directory where the index should be stored. The second parameter provides the implementation of Analyzer that should be used for preprocessing the text before it is indexed. The particular implementation of Analyzer that we are using here uses the whitespace character as the delimiter for tokenizing the input. The last parameter is a boolean flag that, when true, tells IndexWriter to create a new index in the specified directory, or to overwrite any existing index in that directory. A value of false instructs IndexWriter to add Documents to an existing index instead. We then create a blank Document, and add several Text Fields to it. After the Document is populated, we add it to the index through the instance of IndexWriter; finally, we close the index. Closing the IndexWriter is important, as doing so ensures that all index changes are flushed to the disk.

It is important to note that Lucene offers several types of Fields. In this example I used the Text Fields, because Lucene doesn't just index them, but also stores their original value verbatim in the index. This allows us to show all the contact fields when searching the index. To learn more about other types of Fields in Lucene, see the Resources section.


Marrying Digester and Lucene

Now that you know how to use each of these tools on their own, we can combine the two classes we've written. We'll use Digester to handle XML parsing, and Lucene to handle indexing. You can see the resulting DigesterMarriesLucene class in Listing 4. (Clicking Listing 4 causes a new browser window to open. Keep that window open so you can refer to Listing 4 while reading the following discussion.)

Let's look at some selections from this class in more detail. Just as we did in the AddressBookIndexer class, we need to open the Lucene index for writing using IndexWriter; we do so here in Listing 5. We pass in the path to the index directory, the Analyzer to process all data being indexed, and a createFlag that is set to true, so that the index is opened in the append mode.


Listing 5. Opening the index for writing
String indexDir =
    System.getProperty("java.io.tmpdir", "tmp") +
    System.getProperty("file.separator") + "address-book";
Analyzer analyzer = new WhitespaceAnalyzer();
boolean createFlag = true;

// IndexWriter to use for adding contacts to the index
writer = new IndexWriter(indexDir, analyzer, createFlag);

The modified addContact(Contact) method shown in Listing 6 now creates a fresh instance of the Lucene Document every time it is called. After the Document is populated with data from the Contact instance that is passed into the method, it is added to the index through an instance of IndexWriter.


Listing 6. New addContact(Contact) method adds the document to the index
Document contactDocument  = new Document();
contactDocument.add(Field.Text("type", contact.getType()));
contactDocument.add(Field.Text("name", contact.getName()));
contactDocument.add(Field.Text("address", contact.getAddress()));
contactDocument.add(Field.Text("city", contact.getCity()));
contactDocument.add(Field.Text("province", contact.getProvince()));
contactDocument.add(Field.Text("postalcode", contact.getPostalcode()));
contactDocument.add(Field.Text("country", contact.getCountry()));
contactDocument.add(Field.Text("telephone", contact.getTelephone()));
writer.addDocument(contactDocument);

Finally, in Listing 7, at the end of the main() method, the index is optimized and closed to ensure that all Documents added to it are indeed written to the index on the disk.


Listing 7. Optimizing and closing the index
// optimize and close the index
writer.optimize();
writer.close();


Using Lucene to search text

Now that we can create a Lucene index from a document containing address books entries encoded in XML, all we need is the ability to search that index. Lucene's API for searching is as simple as the indexing API. In the class in Listing 8, we search the index we created with the DigesterMarriesLucene class. Here, we run a query that looks for all contacts that contain the keyword "Zane" in the field called name.


Listing 8. Searching the address book index created with the Lucene indexer
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.Hits;
import java.io.IOException;


/**
 * <code>AddressBookSearcher</code> class provides a simple
 * example of searching with Lucene.  It looks for an entry whose
 * 'name' field contains keyword 'Zane'.  The index being searched
 * is called "address-book", located in a temporary directory.
 */
public class AddressBookSearcher
{
    public static void main(String[] args) throws IOException
    {
        String indexDir =
            System.getProperty("java.io.tmpdir", "tmp") +
            System.getProperty("file.separator") + "address-book";
        IndexSearcher searcher = new IndexSearcher(indexDir);
        Query query = new TermQuery(new Term("name", "Zane"));
        Hits hits = searcher.search(query);
        System.out.println("NUMBER OF MATCHING CONTACTS: " + hits.length());
        for (int i = 0; i < hits.length(); i++)
        {
            System.out.println("NAME: " + hits.doc(i).get("name"));
        }
    }
}

You can see that the IndexSearcher class is used for accessing an existing index. The argument passed to its constructor is the path to the directory where the index is stored. Lucene provides a few different query types, and TermQuery is the simplest of them. The query in the code above will find all listings that contain the term "Zane" in a field called name. The call to IndexSearcher's search(Query) method executes the search against the index and returns a collection of matching Documents in an instance of Hits.

While this search example is very simple, note that Lucene offers a rich set of search-related features. For instance, you can use several different types of queries with Lucene: boolean queries, wild-card queries, phrase queries, and so on. Lucene also offers the ability to search multiple indices at once, as well as the ability to search indices located on remote computers. Another useful feature is Lucene's QueryParser, which supports a powerful and user-friendly query syntax. For more information about Lucene's query syntax, see the Resources section.


Conclusion

You should now have good understanding of how to use Jakarta Commons Digester to parse XML documents and how to use Jakarta Lucene to index XML documents and search the resulting index. The approach described in this article should satisfy the simple XML indexing and searching needs of most developers. You should also take a look at the Sandbox subproject of Lucene, which includes examples of indexing of XML documents using SAX 2 and DOM parsers. For more complex and generic solutions, visit Lucene's contributions page, a link to which is included in Resources.


Resources

About the author

Otis Gospodnetic is an active Apache Jakarta member, a developer of Lucene, and maintainer of the jGuru's Lucene FAQ. His professional interests include Web crawlers, information gathering and retrieval, and distributed computing. Otis currently lives in New York City and can be reached at otis@apache.org.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in

If you don't have an IBM ID and password, register here.


Forgot your IBM ID?


Forgot your password?
Change your password


By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)


By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Java technology, XML
ArticleID=10820
ArticleTitle=Parsing, indexing, and searching XML with Digester and Lucene
publish-date=06032003
author1-email=otis@apache.org
author1-email-cc=

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).