Sphinx index and MySQL

Sphinx index and MySQL

The problem

I’ve been trying to optimize multiple MySQL databases to perform full-text searches. My first attempt was using MyISAM table engine with full-text index support (at this moment the only engine with full-text search compatibility). At the very beginning this was acceptable (performance-wise). But as soon the data on the table was growing the first issues appears.

A MyISAM full-text index operates on a full-text collection (made up of one or more character columns from a single table). In effect, MySQL builds the index by concatenating the columns in the collection and indexing them as one long string of text.

One of the limitations is that unlike normal queries, the full-text search results are automatically ordered by relevance. MySQL cannot use an index for sorting when you perform a full-text search. Therefore, you shouldn’t specify an ORDER BY clause if you want to avoid a filesort.

On the other hand the problem with large datasets is if the whole full-text indexes doesn’t fit in memory, you will have a lot of I/O disk operations in order to find and update data into the index. MyISAM doesn’t perform well doing phase searches, because it doesn’t store the words together, in fact if you want to search for “nice cat” it should search for “nice” then “cat” and at the end sorting those results again to find those two words together, plus, if the index doesn’t fit in memory you will be in trouble.

As I said before, you shouldn’t use ORDER BY clause, or even worse, you shouldn’t combine full-text indexes with standard B-Tree indexes. Lets put this into an example. Suppose that we want to search for books containing “nice cat” of a given author, let say author_id = 4. In this particular case you should combine both indexes, but what MySQL will do is to use a full table scan to get the results.

The solution

Combining the limitations of MyISAM and the requirements of today applications a new index technique is mandatory. I’ve been looking for the right solution for my application and one of the best one was Sphinx.

To be continue… keep in touch!

6 Comments

  1. Hi,

    Its a nice article.

    What is best Solr or Sphinx ?

    I have not been figuring it out whether to user Solr or Sphinx ? I’m working with the very high traffic website which is fully based on Ajax and php-mysql.

    So a little help is appreciated.

    Many thanks,
    Sachin Pethani

  2. Hi Sachin,

    My experience is based on sphinx only. But I can tell you that I’m working with more than 5 Million records and plenty of text. I have a dedicated server (Amazon Cloud) and works like a charm. As you say, I’m using php/mysql application.
    Let me know if you need any help in this topic.

    Good luck!
    Angel.

  3. Hi Angel

    thanks for your prompt reply.

    Well, so far we have selected Solr to be used as full text search engine.

    I have been working with zend framework based application.

    About this search engine, I got couple of questions in my mind.

    1) Is that good practice to re-index the Solr for every updates made in MYSQL database or do it periodically rather than instantly by setting some cron job ?

    i.e. if the new text has added in the database then at the same time ping Solr to re-index whole data for that recently added record ? I mean, is there any performance bottleneck of this approach ?

    2)should i display the content directly from the Solr or first fetch the ids from the Solr and then feed those ids to mysql to fetch the relevant records? Which one do you prefer ?

    Assume that in above cases, Solr equals to Sphinx then what are your feedback about 1 and 2.

    Thanks for your help

    Regards,
    Sachin

  4. Hi Sachin, welcome back!

    Ok, assuming Solr equals to Sphinx I’m going to answer you questions:

    1) Depends on your applications needs. For instances, if you want to allow your users to search on a fresh index every time, you should re-index your index as quick as possible, but as you said, this could be a bottleneck for your application. Lets think on a high load application, with very frequent data INSERT nor UPDATES. That will lead you to update constantly your indexes, and that’s not good. First you need to fetch your database to get the records and re-index them, or you could get only the new records (if you only have INSERTs) and using a Delta-Main config.
    What I’m doing for my application is using crontab. For some indexes I run it every hour, for others every day. And is working well for me.

    2) I’m getting only the Id´s and using the “id IN (z,x,y)” sql instruction to get them. This is a good approach since the field id is auto-incremental PRIMARY KEY and its execute like a charm.

    Let me know if you want to know something else!
    Regards,
    Angel.

  5. Kennith Leming |

    Hey nice blog, i have bookmarked the site for later use, thanks.

Leave a Reply