LINQ & Lambda, Part 4: Lucene.Net

Lucene is a jewel in the open source world - its a highly scalable, fast search engine written in Java. Its class-per-class C# cousin - Lucene.Net - is also a precious stone in the .NET universe. Over the years, Lucene.Net has gained considerable popularity and is used in a wide range of products.

C|NET, Fogbugz on Demand and Wikipedia are just a few sites that use Lucene.Net for indexing and searching their vast content. Even Microsoft has been forced to deal with a Lucene powered competitor Outlook extension that embarrasses the built-in search engine.

It's an extremely useful tool in the belt of any software engineer. Not just for it's design and implementation of modern information retrieval theories, but for it's simple API too.

From my experiences, libraries with simple concepts have simple API's. Lucene is no different because the search engine concept is very simple. A search engine is a system for finding text from a large set very quickly.

At the heart of the engine is the index (similar to a database table) with fields (like database columns) that contains documents (like database rows). To search one must write a query and give it to the engine to finding matching documents. The query language for a database is SQL and for Lucene it's a query object (you can construct complex queries by composing an object graph of query instances).

Search engines are super fast for finding text because documents are stored in an inverted index (where the terms of each field is tokenized, hashed and sorted at index time).

In contrast to database queries, search engines can calculate relevance scores when searching. This is because they use a better querying model called the vector space model instead of the classical boolean model. In the vector model, documents and queries are represented as vectors. The similarity between a query and any document can be calculated with simple vector operations. Documents with a higher similarity will appear higher in the results. Conversely, databases only know if rows meets the where criteria or not and cannot compute a relevance score - this true/false classification is how the boolean model got it's name.

Search engines are rad!! And with Lucene any developer can easily add a search engine to their application.

To satiate the rampant LINQ junkie within, I've been contributing to the LINQ to Lucene project - an open source LINQ provider framework.

My recent contribution to the project makes creating indexes and searching them even easier. Today, I'm going to demonstrate some features I've added so you too can easily Lucene-ify your application.

A common use of Lucene in applications is to complement the database with a Lucene index for searching. LINQ to Lucene offers this functionality out of the box with LINQ to SQL from data generated classes.

Let's start with the Northwind database. By following Scott Gu's instructions, you can create a DBML file with generated classes from the database schema. This demonstration only needs the Customer and Order table.

linq to sql

The next step is to inform LINQ to Lucene which classes to index and how they are indexed using attributes. We do this extending the Customer and Order generated classes and decorating them with attributes. If you don't know how to create partial classes, see Chris Sainty's post.



[Document]
public partial class Customer {
}

[Document]
public partial class Order {
}


Now we've marked these two classes as documents for indexing. The next step is to specify how each field is indexed. But, there is a bit of a problem with properties in generated classes. There is no way to add attributes to generated properties without spoiling the automatically generated files.

Our solution to this is to add the MetadataType property on the DocumentAttribute which tells Lucene to look at another class for the field attributes it needs. Like so:



[Document(Metadatatype= typeof(CustomerMetadata))]
public partial class Customer {
}

[Document(Metadatatype= typeof(OrderMetadata))]
public partial class Order {
}


Now we can create fake properties on our metadata types to specify field characteristics. The return type of the fake properties doesn't matter, only the name has to match.


public class CustomerMetadata {

    [Field(FieldIndex.Tokenized, FieldStore.Yes, IsDefault = true)]
    public object ContactName { get; set; }

    [Field(FieldIndex.Tokenized, FieldStore.Yes, IsKey= true)]
    public object CustomerID { get; set; }

    [Field(FieldIndex.Tokenized, FieldStore.Yes)]
    public object ContactTitle { get; set; }

    [Field(FieldIndex.Tokenized, FieldStore.Yes)]
    public object CompanyName { get; set; }
}

public class OrderMetadata {

    [Field(FieldIndex.Tokenized, FieldStore.Yes, IsKey = true)]
    public object OrderID { get; set; }

    [Field(FieldIndex.Tokenized, FieldStore.Yes, IsDefault = true)]
    public object CustomerID { get; set; }

    [Field(FieldIndex.Tokenized, FieldStore.Yes)]
    public object ShipName { get; set; }

    [Field(FieldIndex.Tokenized, FieldStore.Yes)]
    public object ShipAddress { get; set; }

    [Field(FieldIndex.Tokenized, FieldStore.Yes)]
    public object ShipCity { get; set; }

    [Field(FieldIndex.Tokenized, FieldStore.Yes)]
    public object ShipCountry { get; set; }
}


Now we've told Lucene.Net which properties to index and how to index them.

  • The FieldIndex property indicates whether or not the field is tokenized (i.e. split into parts). By default, this is UnTokenized to save index space.
  • The FieldStore property tells Lucene whether or not to store the original value in the index. Again, by default, this is NO to save index space.
  • IsKey should be true for the fields primary key. Only one property can be marked as the key, so LINQ to SQL classes with composite keys should have a new uniquely identifying property.
  • IsDefault tells Lucene which field is the default field for searching.

Now we're ready to create the index.

To create the index from a LINQ to SQL Data Context, we use the DatabaseIndexSet like so:



var dbi = new DatabaseIndexSet(
													 @"C:\index\",                  // index path                                          
												     new NorthwindDataContext()     // data context instance                                          
									                 ); 
dbi.Write();


By running this code, we'll create an index for the entire contents of the Customer and Order tables. LINQ to Lucene is smart enough to collect all the relevant data from the Northwind database, convert each row to a Lucene Document and add it to the index.

Linq to lucene sample indexing

Sweet! We've got an index of the Customer and Order tables.

Now we can search for Customers or Orders.

Let's find all the Customers who are Marketing Managers...



var mmCustomers = from c in dbi.Get()
                      where c.ContactTitle == "Marketing Manager"                       
                      select c;
 
Console.WriteLine("Marketing Manager Customers: Found {0}", mmCustomers.Count());

foreach (var customer in mmCustomers) {
    Console.WriteLine(customer.ContactName);
}


Linq to lucene sample marketing search

This very simple example demonstrates the query equality operator, but many other types of query are possible (including wildcards, prefixes, proximities). You can find more ways to query on the LINQ to Lucene homepage.

In future posts, I'll demonstrate more complex query examples and how to supply custom formatters for properties.

To see the source of LINQ to Lucene, you can get the latest release. The sample project from this post is available for download here.

NOTE: The sample project uses the Northwind database as a data source. Northwind can be downloaded and installed to your SQL Server instance from here.

kick it on DotNetKicks.com

Tags: , , ,

posted on Monday, July 21, 2008 1:51 AM Print
Comments
Gravatar
# re: LINQ & Lambda, Part 4: Lucene.Net
buk
7/22/2008 2:08 AM
  
"Like a database table, all documents in an index have the same structure (i.e. homogenous)."

This is incorrect - you can have documents with different fields in the same index. For example, lets say you have a Customer Service application. You can store Customers, Orders, and Order Items in the same index. Using a type field to differentiate the record / document types - allows you to return all of the data types at once that are relevant to a query or filter them on a specific type.
Gravatar
# re: LINQ & Lambda, Part 4: Lucene.Net
Vijay Santhanam
7/22/2008 1:02 PM
  
That's true, but Lucene will throw an exception if you try to search documents without a field used in a query. Give it a try, it throws for me. The safest practice I found was to keep documents homogenous.

That said, I haven't tried filters with non-homogenous documents.

I'll clarify this.
Gravatar
# re: LINQ & Lambda, Part 4: Lucene.Net
buk
7/23/2008 3:08 AM
  
Sorry - I always seem to omit a critical detail ;). A single common field for "searchable content" enables the scenario I wrote about above. Obviously this is beyond LINQ-to-Lucene, but it enables a simpler search utility for applications. Simpler than creating an "advanced" search interface with input fields matched up to each entity / lucene document field.
Gravatar
# re: LINQ & Lambda, Part 4: Lucene.Net
Vijay Santhanam
7/23/2008 5:02 AM
  
@buk
Indeed, filters may support your scenario too. Rolling all information into one field is very useful for searching.

Addresses are a great example. A combined field with street number/name, postal code, suburb, city, state is more useful than separate fields. Can you throw another example at me?

LINQ to Lucene can do this with an extra property, but DAL objects/Busines objects tend not to be structured this way.

One idea I had was to create a separate Lucene Access Layer (with combined fields) and use Shade Tree to map between LAL and DAL.

One goal of mine is to not force developers to think about structure in the Lucene way. Although, you do raise a very common scenario.

I apologize for not being precise in my explanations of Lucene and information retrieval - but I think I'm writing for beginners. I'm not entirely sure yet.
Gravatar
# re: LINQ & Lambda, Part 4: Lucene.Net
Paul
8/14/2008 1:39 AM
  
Vijay,
Great tutorial! I'm fired up to try it out in a real project now that I've run through your tutorial.

A couple quick questions. The Lucene docs say that for a web app, you should 'keep your index open' across postbacks, either via the app cache or static objects or whatever. Is that true for Linq-to-Lucene as well?

Second, how often would you recommend managing re-indexing? I was contemplating writing the index on application start, and then when new objects get added, do it through a repository object which then also writes them to the index, rather than re-indexing the whole db table again.

Third, I wasn't clear from your description why I would want to tokenize, can you explain or link?

Thanks,
Paul
Gravatar
# re: LINQ & Lambda, Part 4: Lucene.Net
Vijay Santhanam
9/7/2008 8:04 PM
  
I answered Paul's questions in a discussion:

http://www.codeplex.com/linqtolucene/Thread/View.aspx?ThreadId=35088
Gravatar
# re: LINQ & Lambda, Part 4: Lucene.Net
Grd
9/12/2008 1:29 PM
  
Hi
If I would like to use only a couple of columns in my table and also the indexing takes a lot of time. Could you please point me to some links where I can see samples of using Lucene & SQL.
Thanks,
Gokul
Gravatar
# re: LINQ & Lambda, Part 4: Lucene.Net
Vijay Santhanam
10/22/2008 12:55 PM
  
@Grd
Only put Field attributes on the properties you want to index.
Title *
Name *
Email
Website
Comments (no HTML) *  
Please add 8 and 8 and type the answer here:
Posts
8
Comments
32
Trackbacks
0