searching the world in 231 seconds

update

Please refer to the official documentation here, rather than this post.

launch’d

I’m pleased to release the new CHOW search:

no delta, no no no

Sphinx is so fast that we don’t run an index queue. Re-indexing all of CHOW takes 4 minutes in production:

$ time indexer --config sphinx.production.conf complete
Sphinx 0.9.7
Copyright (c) 2001-2007, Andrew Aksyonoff

collected 405482 docs, 1095.1 MB
sorted 199.0 Mhits, 100.0% done
total 405482 docs, 1095082069 bytes
total 228.087 sec, 4801154.00 bytes/sec, 1777.75 docs/sec

real    3m51.321s
user    3m24.121s
sys     0m21.656s

That’s crazy.

why

Even though Solr/Lucene was available as an in-house CNET product, we dropped it in favor of Sphinx’s simplicity.

Sphinx accesses MySQL directly, so the interoperability happens at that level rather than in-app. This means you don’t need any indexing hooks in your models. Their lifecycle doesn’t affect the search daemon.

Plus, our old indexing daemon would mysteriously die and not restart. The Sphinx indexer just runs on a cronjob.

free codes

Kent Sibilev released acts_as_sphinx a while back, but I had already started on my implementation of a Rails Sphinx plugin. As it turned out, our needs were more sophisticated, so it’s good I did.

Mine is called Ultrasphinx, and features:

ActiveRecord-style SQL generation
- association includes via GROUP_CONCAT
- field merging
- field aliasing
excerpt highlighting
runtime field-weighting
Memcached integration via cache_fu
query spellcheck via raspell
will_paginate compatibility
Google-style query parser
multiple deployment environments

Of course it inherits from Sphinx itself:

Porter/Soundex stemming
UTF-8 support
no stopwords (configurable)
ranged queries (for example, dates)
boolean operators
exact-phrase match
filters (“facets”)
rock-solid stability

Downsides of my plugin are:

API could be better
some features are MySQL-specific

The biggest benefit, really, is the SQL generation and index merging, which are related. The SQL generation lets you configure Sphinx via:

is_indexed(
  :fields => ["title", {:field => "post_last_created_at", :as => "published_at"}, "board_id"],
  :includes => [{:model => "Board", :field => "name", :as => "board"}],
  :concats => [{:model => "Post", :field => "content",
                :conditions => "posts.state = 0", :as => "body"}],
  :conditions => "topics.state = 0")

That is, you can :include fields from unary associations, and :concat fields from n-ary associations. For example, in this case, we are indexing all replies to a topic as part of that topic’s body.

Because the SQL is generated, by paying careful attention to Sphinx’s field expectations, we can create a merged index which allows us to rank totally orthogonal models by relevance.

Sphinx does require a unique ID for every indexed record. We work around this by using the alphabetical index of the model class as a modulus in an SQL function.

download

script/plugin install -x svn://rubyforge.org/var/svn/fauna/ultrasphinx/trunk

Documentation is here.

to the future

I don’t have much time to support this outside of our needs at CNET. So if you need something Certified and Enterprise Ready, I guess use Lucene, or maybe that French one I can’t spell.

If you need something faster, simpler, and more interesting, Sphinx + Ultrasphinx will be awesome.

Patches welcome; just ask if you want to be a committer. The support forum is here.

postscript

Who just searched for “señor fish”? Not kidding:

[eweaver@cnet search]$ rake ultrasphinx:daemon:tail
Tailing /opt/sphinx/var/log/query.log
  whole wheat pasta
  senor fish

Hopefully it’s better than the shrimp burrito I made the other day. That was kind of gross.

13 responses

Andrew Aksyonoff says:

August 20, 2007 at 8:29 PM

As for “poorly documented”, I’d like that to change.

So if you run into something which really, really should be in the docs, email me a note.
Dr Nic says:

August 20, 2007 at 8:29 PM

Andrew: But then Sphinx wouldn’t be the “ningx of full-text search”!
seb says:

August 20, 2007 at 8:29 PM

“Maybe that French one I can’t spell.” It’s Japanese, not French. :)
Baby Face says:

August 20, 2007 at 8:29 PM

As a Sphinx user, I would say that Sphinx used to be poorly documented. There is also a forum. But not any more. Sphinx is now well documented.
evan says:

August 20, 2007 at 8:30 PM

Seb: it ought to be French!

Regarding the documentation, I mostly worked from the sample .conf file and the forum. Some big things, such as extended mode, and many little things, such as when you have to restart the daemon, the meaning of the log files, whether field order is important, what SQL types are valid, example charset_tables, and the like, were still left unexplained, especially on the official documentation page.
Zach Baker says:

August 20, 2007 at 8:31 PM

I bet the “senor fish” guy is looking for Señor Baja, which is a mini-chain of fish-taco restaurants here in the LA area. You think that’s a silly name, but they changed their name from El Taco Nazo. Which sounds like you’re going to go there and get “No taco for you!”
Nima Negahban says:

August 20, 2007 at 8:32 PM

You just keep on giving; you are the man.
Jason L Perry says:

August 20, 2007 at 8:33 PM

Okay, sorry for being so wildly off topic (I was here to read about Sphinx, honestly!). “El Taconazo” is also a Taco place near me in Tampa. It’s more commonly known as the Taco Bus:

HOLY SHIT, THOSE ARE GOOD TACOS.

Also, you can’t really tell in the photo, but that’s a sink on the front bumper.
roberthahn says:

August 20, 2007 at 8:34 PM

I don’t think Jason’s post is offtopic at all. As we’ve observed before, it has to be related. :)
evan says:

August 20, 2007 at 8:35 PM

Finally, someone who understands!
Jan Prill says:

August 20, 2007 at 8:36 PM

At least on Windows another prerequisite seems to be the chronic gem.
Thibaud Guillaume-Gentil says:

August 20, 2007 at 8:37 PM

It’s seems to be a great plugin, now I use acts_as_solr (not so bad) since I really need the function find_id_by_solr().

It is possible with Ultrasphinx to have only ids in the results array? If it’s possible, I’d switch directly. :-)
evan says:

August 20, 2007 at 8:38 PM

Jan: Thanks; fixed.

Thibaud: Yes, just pass false as the parameter to .run().

snax

on software