Rails, Xapian and acts_as_xapian

We have been wrapping up a current Rails project that entails creating a searchable directory of organizations, businesses and individuals. In order to add full text search functionality into the Rails application that we are developing, we decide to use Xapian as a search engine. In the past, we have used Sphinx and Ferret, but for this project we decided to delve into using Xapian because of some of the additional features that it provides us.

As with most full text search engines, there is a Rails plugin available called acts_as_xapian that we can use to easily interact with Xapian from our Rails application. Along the way, we ran into a couple of hitches and had a bit of trouble finding some explanations in the available documentation, so we thought that we would post our experience using Xapian here to help out any other Rails developers that are interested in using Xapian.

This is not intended to be a thorough documentation of using Xapian or acts_as_xapian, but rather a chronicle of our experience implementing full text search into a Rails application.

Installing Xapian

The first step to installing Xapian is to download the latest version of xapian-core and extract:

curl -O  http://oligarchy.co.uk/xapian/1.0.10/xapian-core-1.0.10.tar.gz
tar -xzvf xapian-core-1.0.10.tar.gz

From here, we need to configure and build the Xapian library:

cd xapian-core-1.0.10
./configure
make
sudo make install

Once this is complete, we need to download the Xapian bindings, which will let Xapian interact with Ruby, PHP and other scripting languages:

curl -O http://oligarchy.co.uk/xapian/1.0.10/xapian-bindings-1.0.10.tar.gz@
tar -xzvf xapian-bindings-1.0.10.tar.gz

Again, now that we have downloaded them, we need to configure and build the Xapian-bindings:

cd xapian-bindings-1.0.10
./configure
make
sudo make install

Once that finishes, Xapian will be installed on your system. Of course, these instructions are for installing from source. If you are using a system that has aptitude installed, or another package manager, you can also install Xapian that way. For example, to install Xapian using aptitude:

aptitude search xapian

You need the following two files to complete the install:

libxapian15
libxapian-ruby1.8	- Xapian search engine interface for Ruby 1.8

Finally, simply enter the following commands to install the necessary components:

sudo aptitude install libxapian15
sudo aptitude install libxapian-ruby1.8

For more detailed instructions, visit http://xapian.org/docs/install.html.

Installing the acts_as_xapian Rails plugin

Acts_as_Xapian is a plugin that allows your Rails application to interact with the Xapian search engine. To install the plugin into your Rails application, simply type:

script/plugin install git://github.com/frabcus/acts_as_xapian.git

Once the plugin is installed, run the migrations to add the required tables to your database:

script/generate acts_as_xapian
rake db:migrate

Setting Up The Models

In order to be able to search for a particular model using Xapian, you will need to index that model. For example, if we had a model called Post, which we want to be searchable, we would add the following line of code to app/models/post.rb:

acts_as_xapian :texts => [:title, :content]

As you can see, the acts_as_xapian method accepts a hash of the fields that you want to include in the search index. In this case, we want the title and content fields of our Post model to be indexed. In other words, when we do a search with Xapian on our Post model, we expect it to return results that have our search term located somewhere in the title or content fields.

There are other options that you can pass to the acts_as_xapian method, which we will look at a bit later.

Building and Updating the Xapian Indexes

Xapian indexing takes place “offline,” which means that the index is not automatically updated after a model is changed. While this may change in the future, currently the index must be built and updated periodically by you, or using a cron job.

Once you have set up the models that you would like to index for searching (in our case it’s only the Post model), you will need to build the index. Issue the following command, replacing Post with a comma-separated list of the models that you want to index:

rake xapian:rebuild_index models="Post"

Occasionally, this index will need to be updated. For example, let’s say that you have added a new post, or that you have made some changes in an existing post. Simply saving these changes to the database will not update the search index. If you have added a new post, you will want to make sure that it shows up in future searches, so you will need to update the Xapian index by running the following rake task:

rake xapian:update_index

Updating the index manually might be fine for development, but in a production environment you will want to have your index updated periodically by a background process, cron job or daemon.

Also, don’t forget that both these commands default to the development environment, so you will want to be sure and add RAILS_ENV=production when you run the rake tasks in your production environment.

Searching With Xapian

Now that the index has been created, and our models are all set up, we can begin to run searches using Xapian. In our example, we are going to be searching through the title and content fields of the Post model to find results to our query. Our controller code to do this search looks something like the following:

search = ActsAsXapian::Search.new([Post], query, :limit => 25)

As you can see, we are searching our Post model to find results to query, and we would only like 25 records returned. We now have a search variable to work with. However, to get at the actual posts that are returned from our search, which are presumably relevant to our query, we need to collect them from the search results:

@posts = search.results.collect {|p| p[:model]}

Similar Results and Spelling Corrections

Xapian has a couple of features that make it very attractive as a Rails full text search option. One such feature is the ability to return similar results. I find that this comes in handy if there aren’t very many results to display. For example, if there are less than 3 search results to display, we might decide to present our visitor with a couple of similar results, to help them find information that they might be interested in.

To get similar results, we need to create a similar object in much the same way that we created our search object. The only difference is that this time we are creating an ActsAsXapian::Similar instance, rather than the Search instance that we created above. To get similar results for our Post model, we use the following controller code:

similar = ActsAsXapian::Similar.new([Post], @posts, :limit => 25)

Notice that we are passing in the list of results from our initial search (@posts), and once again we are limiting our results to 25. Once again, we need to collect the results into something that we can work with:

@similar_posts = similar.results.collect {|p| p:[:model]}

Another great feature of Xapian is that it returns potential spelling corrections for query, based on data in the index. Spelling corrections are returned automatically when we perform our search, and are contained in the spelling_corrections method of our search object. We can extract spelling corrections easily with the following controller code:

@corrections = search.spelling_corrections

Filtering Results Using Terms

Remember how I said that there were more options that we could pass to the acts_as_xapian method? Well, one of them describes “terms” that we can use to filter results. Let’s say that we wanted users to be able to search for all posts in a blog that reference a particular city. In our database, each post record has a city field that we use to reference a city that our post refers to. We want to be able to search all the records that refer to particular city to see which ones match our search terms.

We start by modifying our Posts model, adding the :terms option to our acts_as_xapian method:

acts_as_xapian => :texts => [:content, :title],
                  :terms => [[:city, 'C', "city"]]

As you can see, the :terms options accepts an array of triples– [:field, 'unique character', "prefix"]. You can add as many terms as you would like to the index, as long as each one has a unique character to identify it. Remember, you will have to rebuild the index in order for your changes to appear, and it probably wouldn’t hurt to restart your application server just to be sure.

Now, using a syntax that might seem familiar, we can search within the Post model for all records that refer to a particular city by using the following query:

city: Portland keyword

Where “keyword” is our search terms. In other words, Xapian will now search for all records that match “keyword” and also have a value of “Portland” in the city field. Not bad. This can be used for a variety of reasons, such as filtering results (showing only results that are ‘active’ or ‘published’) and also to provide your users with a way of fine-tuning their searches (think ‘site:’ in Google).

If you are interested in using Xapian to search with associations, take a look at this post and this post from the ProjectX Blog, which allow you to run searches like:

current_user.posts.find_with_xapian @query

Sorting and Collapsing Results With Values

Xapian gives us the ability to sort or collapse the results based on a field in the model that has a sortable range. The documentation states that the field can be of the type :string, :number, or :date. Once again, we need to add another option to the acts_as_xapian method in our Post model. In this case, we want to add the ability to sort by the day the post was created, so we would make the following change:

acts_as_xapian => :texts => [:content, :title],
                  :values => [[:created_at, 0, "created_at", :date]]

Keep in mind that we can use values and terms in the same model, but I am keeping it simple here. Just like before, we are passing a series of arrays to the :values option, where each is a quadruple– [:field, 'unique identifier', "prefix", data_type]. Just like with :terms, we can add multiple :values by simply passing in more quadruples, each separated by a comma. Once again, we’ll want to rebuild our index in order to our changes to be picked up. Now we can sort our search results based on particular values, such as a creation date:

search = ActsAsXapian::Search.new([Post], query, :limit => 25, :sort_by_prefix => "created_at")

The :values option also gives us the ability to collapse our search results, which means to only show one result for each of the :values that we specify.

Quirks We’ve Noticed

There are a couple of quirks that we have noticed here along the way, and we thought that sharing them might be the best way to help other people make sense of them.

If you are using terms as a way to filter your result set, they are not updated when running the update_index rake task. However, if you rebuild the index using the rebuild_index rake task the terms are updated and the filtering will work correctly.

On a similar note, if you delete data and just do an update you may get an error. I deleted data, updated the index and then did a search that matched the deleted record and I got a nil object error. Rebuilding the index instead of updating resolved the error.

When using the spelling correction, I kept rebuilding the index because it seemed to not be filtering correctly using the terms I had provided. It turns out that this is a current limitation, the Xapian documentation states that “currently, spelling correction ignores prefixed terms”. In other words, spelling corrections are drawn from the entire search index, and are not limited to the terms that you might be using for your search.

I was testing the sort_by_prefix on the created_at fields and it did not seem to be sorting at all. After digging around for a while, I realized that when sorting by dates it is important to understand that only the date is sorted, not the time. My problem stemmed from the fact that my sample data was all created on the same day, so the sorting did not work.

Brian Getting

Published by Brian Getting on Friday, January 30, 2009.
Subscribe to our blog RSS feed.