searching the world in 231 seconds

update

Please refer to the official documentation here, rather than this post.

launch’d

I’m pleased to release the new CHOW search:


CHOW search

no delta, no no no

Sphinx is so fast that we don’t run an index queue. Re-indexing all of CHOW takes 4 minutes in production:

$ time indexer --config sphinx.production.conf complete
Sphinx 0.9.7
Copyright (c) 2001-2007, Andrew Aksyonoff

collected 405482 docs, 1095.1 MB
sorted 199.0 Mhits, 100.0% done
total 405482 docs, 1095082069 bytes
total 228.087 sec, 4801154.00 bytes/sec, 1777.75 docs/sec

real    3m51.321s
user    3m24.121s
sys     0m21.656s

That’s crazy.

why

Even though Solr/Lucene was available as an in-house CNET product, we dropped it in favor of Sphinx’s simplicity.

Sphinx accesses MySQL directly, so the interoperability happens at that level rather than in-app. This means you don’t need any indexing hooks in your models. Their lifecycle doesn’t affect the search daemon.

Plus, our old indexing daemon would mysteriously die and not restart. The Sphinx indexer just runs on a cronjob.

free codes

Kent Sibilev released acts_as_sphinx a while back, but I had already started on my implementation of a Rails Sphinx plugin. As it turned out, our needs were more sophisticated, so it’s good I did.

Mine is called Ultrasphinx, and features:

  • ActiveRecord-style SQL generation
    • association includes via GROUP_CONCAT
    • field merging
    • field aliasing
  • excerpt highlighting
  • runtime field-weighting
  • Memcached integration via cache_fu
  • query spellcheck via raspell
  • will_paginate compatibility
  • Google-style query parser
  • multiple deployment environments

Of course it inherits from Sphinx itself:

  • Porter/Soundex stemming
  • UTF-8 support
  • no stopwords (configurable)
  • ranged queries (for example, dates)
  • boolean operators
  • exact-phrase match
  • filters (“facets”)
  • rock-solid stability

Downsides of my plugin are:

  • API could be better
  • some features are MySQL-specific

The biggest benefit, really, is the SQL generation and index merging, which are related. The SQL generation lets you configure Sphinx via:

is_indexed(
  :fields => ["title", {:field => "post_last_created_at", :as => "published_at"}, "board_id"],
  :includes => [{:model => "Board", :field => "name", :as => "board"}],
  :concats => [{:model => "Post", :field => "content",
                :conditions => "posts.state = 0", :as => "body"}],
  :conditions => "topics.state = 0")

That is, you can :include fields from unary associations, and :concat fields from n-ary associations. For example, in this case, we are indexing all replies to a topic as part of that topic’s body.

Because the SQL is generated, by paying careful attention to Sphinx’s field expectations, we can create a merged index which allows us to rank totally orthogonal models by relevance.

Sphinx does require a unique ID for every indexed record. We work around this by using the alphabetical index of the model class as a modulus in an SQL function.

download

script/plugin install -x svn://rubyforge.org/var/svn/fauna/ultrasphinx/trunk

Documentation is here.

to the future

I don’t have much time to support this outside of our needs at CNET. So if you need something Certified and Enterprise Ready, I guess use Lucene, or maybe that French one I can’t spell.

If you need something faster, simpler, and more interesting, Sphinx + Ultrasphinx will be awesome.

Patches welcome; just ask if you want to be a committer. The support forum is here.

postscript

Who just searched for “señor fish”? Not kidding:

[eweaver@cnet search]$ rake ultrasphinx:daemon:tail
Tailing /opt/sphinx/var/log/query.log
  whole wheat pasta
  senor fish

Hopefully it’s better than the shrimp burrito I made the other day. That was kind of gross.

13 responses

  1. As a Sphinx user, I would say that Sphinx used to be poorly documented. There is also a forum. But not any more. Sphinx is now well documented.

  2. Seb: it ought to be French!

    Regarding the documentation, I mostly worked from the sample .conf file and the forum. Some big things, such as extended mode, and many little things, such as when you have to restart the daemon, the meaning of the log files, whether field order is important, what SQL types are valid, example charset_tables, and the like, were still left unexplained, especially on the official documentation page.

  3. I bet the “senor fish” guy is looking for Señor Baja, which is a mini-chain of fish-taco restaurants here in the LA area. You think that’s a silly name, but they changed their name from El Taco Nazo. Which sounds like you’re going to go there and get “No taco for you!”

  4. Okay, sorry for being so wildly off topic (I was here to read about Sphinx, honestly!). “El Taconazo” is also a Taco place near me in Tampa. It’s more commonly known as the Taco Bus:

    HOLY SHIT, THOSE ARE GOOD TACOS.

    Also, you can’t really tell in the photo, but that’s a sink on the front bumper.

  5. It’s seems to be a great plugin, now I use acts_as_solr (not so bad) since I really need the function find_id_by_solr().

    It is possible with Ultrasphinx to have only ids in the results array? If it’s possible, I’d switch directly. :-)

Follow

Get every new post delivered to your Inbox.

Join 580 other followers