If you've ever wanted to parse XML documents but have found SAX just a little difficult, this article is for you. In this article, we examine how to use two open source tools from the Apache Jakarta project, Commons Digester and Lucene, to handle the parsing, indexing, and searching of XML documents. Digester parses the XML data, and Lucene handles indexing and searching. You'll first see how to use each tool on its own and then how to use them together, with sample code that you can compile and run.
Commons Digester is a subproject of the Commons project, which is one of the initiatives developed by the community of developers who create open source software under the Apache Jakarta umbrella. Digester offers a simple and high-level interface for the mapping of XML documents to Java objects. When Digester finds developer-defined patterns in XML, it will take developer-specified actions. Digester requires a few additional Java libraries, including an XML parser compatible with either SAX 2.0 or JAXP 1.1. Digester's home page, listed in the Resources section at the end of this article, provides a short list of the libraries that Digester needs.
Lucene is another Apache Jakarta project. Like Digester, it is a Java library and not a stand-alone application. Behind its simple indexing and search interface hides an elegant piece of software capable of handling many documents.
In the rest of this article, we use Digester to parse a simple XML file, then illustrate how Lucene creates indices. Then we marry the two tools to create a Lucene-generated index from our sample XML document, and finally use Lucene classes to search through that index.
We use Digester to parse the simple XML document in Listing 1, which contains entries in an imaginary address book. To demonstrate handling of elements with and without attributes, I decided to make
type
an attribute of the <contact>
element, while leaving all other elements without any attributes.
Listing 1. XML snippet of a fictitious address book
<?xml version='1.0' encoding='utf-8'?> <address-book> <contact type="individual"> <name>Zane Pasolini</name> <address>999 W. Prince St.</address> <city>New York</city> <province>NY</province> <postalcode>10013</postalcode> <country>USA</country> <telephone>1-212-345-6789</telephone> </contact> <contact type="business"> <name>SAMOFIX d.o.o.</name> <address>Ilica 47-2</address> <city>Zagreb</city> <province></province> <postalcode>10000</postalcode> <country>Croatia</country> <telephone>385-1-123-4567</telephone> </contact> </address-book> |
Using Digester to parse the above XML document is very simple, as Listing 2 illustrates. (Clicking Listing 2 causes a new browser window to open. Keep that window open so you can refer to Listing 2 while reading the following discussion.)
The most involved part of using Digester is centralized in the main()
method. After creating an instance of Digester, we have to create rules for actions that are to be triggered when certain patterns are encountered in the XML document that we are parsing. We'll look in more detail at each Digester rule that we defined in Listing 2. Note that the order in which rules are passed to Digester matters a great deal.
The first rule tells Digester to create an instance of the AddressBookParser
class when the pattern "address-book" is found. Because <address-book>
is the first element in the address book XML file, this rule will be the first to be triggered when we use Digester with our XML file.
digester.addObjectCreate("address-book", AddressBookParser.class); |
This next rule instructs Digester to create an instance of class Contact
when it finds the <contact>
child element under the <address-book>
parent.
digester.addObjectCreate("address-book/contact", Contact.class); |
In the following snippet, we set the type
property of the Contact
instance when Digester finds the type
attribute of the <contact>
element.
digester.addSetProperties("address-book/contact", "type", "type"); |
Our AddressBookParser
class contains several rules that look similar to the one shown below. They instruct Digester to invoke the setName()
method of the Contact
class instance and use the value enclosed by <name>
elements as the method parameter.
digester.addCallMethod("address-book/contact/name", "setName", 0); |
Finally, this rule tells Digester to call the addContact()
method when it finds the closing </contact>
element.
digester.addSetNext("address-book/contact", "addContact"); |
Again, it's important that you consider the order in which the rules are passed to Digester. While we could change the order of various addSetProperties()
rules in our class and still have properly functioning code, switching the order of addObjectCreate()
and addSetNext()
would result in an error.
There are four fundamental Lucene classes for indexing text: IndexWriter
, Analyzer
, Document
, and Field
.
The IndexWriter
class creates a new index and adds documents to an existing index.
Before text is indexed, it is passed through an Analyzer
. Analyzer
classes are in charge of extracting indexable tokens out of text to be indexed and eliminating the rest. Lucene comes with a few different Analyzer
implementations. Some of them deal with skipping stop words (frequently used words that don't help distinguish one document from the other, such as a, an, the, in, and on), for instance, while others deal with converting all tokens to lowercase letters, so that searches are not case sensitive.
An index consists of a set of Document
s, and each Document
consist of one or more Field
s. Each Field
has a name and a value. You can think of a Document
as a row in an RDBMS, and Field
s as columns in that row.
Let's consider a simple scenario in which we add a single contact entry with all its fields to the index. Listing 3 shows how we could do it, using the classes we just described.
Listing 3. Lucene-based address book indexer
import org.apache.lucene.index.IndexWriter; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.WhitespaceAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; /** * <code>AddressBookIndexer</code> class provides a simple * example of indexing with Lucene. It creates a fresh * index called "address-book" in a temporary directory every * time it is invoked and adds a single document with a * few fields to it. */ public class AddressBookIndexer { public static void main(String args[]) throws Exception { String indexDir = System.getProperty("java.io.tmpdir", "tmp") + System.getProperty("file.separator") + "address-book"; Analyzer analyzer = new WhitespaceAnalyzer(); boolean createFlag = true; IndexWriter writer = new IndexWriter(indexDir, analyzer, createFlag); Document contactDocument = new Document(); contactDocument.add(Field.Text("type", "individual")); contactDocument.add(Field.Text("name", "Zane Pasolini")); contactDocument.add(Field.Text("address", "999 W. Prince St.")); contactDocument.add(Field.Text("city", "New York")); contactDocument.add(Field.Text("province", "NY")); contactDocument.add(Field.Text("postalcode", "10013")); contactDocument.add(Field.Text("country", "USA")); contactDocument.add(Field.Text("telephone", "1-212-345-6789")); writer.addDocument(contactDocument); writer.close(); } } |
What exactly is happening here? Lucene indices are stored in directories in the filesystem. Each index is contained within a single directory, and multiple indices cannot share a directory. The first parameter in IndexWriter
's constructor specifies the directory where the index should be stored. The second parameter provides the implementation of Analyzer
that should be used for preprocessing the text before it is indexed. The particular implementation of Analyzer
that we are using here uses the whitespace character as the delimiter for tokenizing the input. The last parameter is a boolean flag that, when true
, tells IndexWriter
to create a new index in the specified directory, or to overwrite any existing index in that directory. A value of false
instructs IndexWriter
to add Document
s to an existing index instead. We then create a blank Document
, and add several Text Field
s to it. After the Document
is populated, we add it to the index through the instance of IndexWriter
; finally, we close the index. Closing the IndexWriter
is important, as doing so ensures that all index changes are flushed to the disk.
It is important to note that Lucene offers several types of Field
s. In this example I used the Text Field
s, because Lucene doesn't just index them, but also stores their original value verbatim in the index. This allows us to show all the contact fields when searching the index. To learn more about other types of Field
s in Lucene, see the Resources section.
Now that you know how to use each of these tools on their own, we can combine the two classes we've written. We'll use Digester to handle XML parsing, and Lucene to handle indexing. You can see the resulting DigesterMarriesLucene
class in Listing 4. (Clicking Listing 4 causes a new browser window to open. Keep that window open so you can refer to Listing 4 while reading the following discussion.)
Let's look at some selections from this class in more detail. Just as we did in the AddressBookIndexer
class, we need to open the Lucene index for writing using IndexWriter
; we do so here in Listing 5. We pass in the path to the index directory, the Analyzer
to process all data being indexed, and a createFlag
that is set to true
, so that the index is opened in the append mode.
Listing 5. Opening the index for writing
String indexDir = System.getProperty("java.io.tmpdir", "tmp") + System.getProperty("file.separator") + "address-book"; Analyzer analyzer = new WhitespaceAnalyzer(); boolean createFlag = true; // IndexWriter to use for adding contacts to the index writer = new IndexWriter(indexDir, analyzer, createFlag); |
The modified addContact(Contact)
method shown in Listing 6 now creates a fresh instance of the Lucene Document
every time it is called. After the Document
is populated with data from the Contact
instance that is passed into the method, it is added to the index through an instance of IndexWriter
.
Listing 6. New addContact(Contact) method adds the document to the index
Document contactDocument = new Document(); contactDocument.add(Field.Text("type", contact.getType())); contactDocument.add(Field.Text("name", contact.getName())); contactDocument.add(Field.Text("address", contact.getAddress())); contactDocument.add(Field.Text("city", contact.getCity())); contactDocument.add(Field.Text("province", contact.getProvince())); contactDocument.add(Field.Text("postalcode", contact.getPostalcode())); contactDocument.add(Field.Text("country", contact.getCountry())); contactDocument.add(Field.Text("telephone", contact.getTelephone())); writer.addDocument(contactDocument); |
Finally, in Listing 7, at the end of the main()
method, the index is optimized and closed to ensure that all Document
s added to it are indeed written to the index on the disk.
Listing 7. Optimizing and closing the index
// optimize and close the index writer.optimize(); writer.close(); |
Now that we can create a Lucene index from a document containing address books entries encoded in XML, all we need is the ability to search that index. Lucene's API for searching is as simple as the indexing API. In the class in Listing 8, we search the index we created with the DigesterMarriesLucene
class. Here, we run a query that looks for all contacts that contain the keyword "Zane" in the field called name
.
Listing 8. Searching the address book index created with the Lucene indexer
import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.TermQuery; import org.apache.lucene.index.Term; import org.apache.lucene.search.Hits; import java.io.IOException; /** * <code>AddressBookSearcher</code> class provides a simple * example of searching with Lucene. It looks for an entry whose * 'name' field contains keyword 'Zane'. The index being searched * is called "address-book", located in a temporary directory. */ public class AddressBookSearcher { public static void main(String[] args) throws IOException { String indexDir = System.getProperty("java.io.tmpdir", "tmp") + System.getProperty("file.separator") + "address-book"; IndexSearcher searcher = new IndexSearcher(indexDir); Query query = new TermQuery(new Term("name", "Zane")); Hits hits = searcher.search(query); System.out.println("NUMBER OF MATCHING CONTACTS: " + hits.length()); for (int i = 0; i < hits.length(); i++) { System.out.println("NAME: " + hits.doc(i).get("name")); } } } |
You can see that the IndexSearcher
class is used for accessing an existing index. The argument passed to its constructor is the path to the directory where the index is stored. Lucene provides a few different query types, and TermQuery
is the simplest of them. The query in the code above will find all listings that contain the term "Zane" in a field called name
. The call to IndexSearcher
's search(Query)
method executes the search against the index and returns a collection of matching Document
s in an instance of Hits
.
While this search example is very simple, note that Lucene offers a rich set of search-related features. For instance, you can use several different types of queries with Lucene: boolean queries, wild-card queries, phrase queries, and so on. Lucene also offers the ability to search multiple indices at once, as well as the ability to search indices located on remote computers. Another useful feature is Lucene's QueryParser
, which supports a powerful and user-friendly query syntax. For more information about Lucene's query syntax, see the Resources section.
You should now have good understanding of how to use Jakarta Commons Digester to parse XML documents and how to use Jakarta Lucene to index XML documents and search the resulting index. The approach described in this article should satisfy the simple XML indexing and searching needs of most developers. You should also take a look at the Sandbox subproject of Lucene, which includes examples of indexing of XML documents using SAX 2 and DOM parsers. For more complex and generic solutions, visit Lucene's contributions page, a link to which is included in Resources.
- Visit the Jakarta Lucene project home page.
- Visit the Jakarta Commons Digester project home page, as well as its parent project, Jakarta Commons.
- The Lucene project home page contains several resources on indexing with Lucene.
- For more on working with Lucene fields, check out "Introduction to Text Indexing with Apache Jakarta Lucene," also by Otis Gospodnetic (O'Reilly OnJava, January 2003).
- Find out the details of the Lucene query syntax.
- For more on indexing XML documents using SAX 2 and DOM parsers, check out the Lucene Sandbox.
- David Mertz discusses XML-specific search and indexing features in his column "XML Matters: Indexing XML documents" (developerWorks, May 2001).
- Find out about Lucene contributions.
- Find hundreds of articles about every aspect of Java programming in the developerWorks Java technology zone.
Otis Gospodnetic is an active Apache Jakarta member, a developer of Lucene, and maintainer of the jGuru's Lucene FAQ. His professional interests include Web crawlers, information gathering and retrieval, and distributed computing. Otis currently lives in New York City and can be reached at otis@apache.org.