Wednesday, August 12, 2009

Lucene Vs Sphinx - A Showdown on a large dataset

There has long existed a battle among the "pro java" and "pro c" developer community. Not that I'd want to strictly be associated with one, but would always say the robustness/exception handling/stability vector have a better cosine value with java as compared to c [With all respect to C, for being the progenitor for Java]. Let me not move off on the tangent and rather more towards the core here. Last few weeks were spent trying to benchmark 'java' lucene as a search engine with 'C' sphinx. Though not exactly in their vanilla form, and a lot of modifications to both, we finally ran a lot of tests on both the engines.

Keeping a common playground with the following specifications:
Processor(s) : Intel Quad Core X 2
RAM : 24G
Operating System : RHEL 32 Bit
Document Corpus : 18 million+ Documents
Source Size : 90G [RDBMS Table]


On a corpus of approximately 18 million records, indexed and not all of them stored. Multiple field queries with varying boost values and some good level of complexity. Here is the result sheet:

Lucene
Index Size : 20G

Concurrency : 30 [5*6 Daemons, 2G each]
Total Searches : 64931
Slow Query Count (>=10 secs) : 3803 ( 5.86%)
Total Duration (secs) : 238094.574
Mean Duration : 3.667
Mode Duration : 0.835
Minimum Duration : 0.001
Maximum Duration : 1174.757
Duration Standard Deviation : 15.441
Search Time (secs) Distribution :-
[0,0.25) : 2770 ( 4.27%)
[0.25,0.5) : 5666 ( 8.73%)
[0.5,1) : 13515 (20.81%)
[1,1.5) : 10928 (16.83%)
[1.5,2) : 7330 (11.29%)
[2,3) : 8476 (13.05%)
[3,5) : 7222 (11.12%)
[5,10) : 5221 ( 8.04%)
[10,20) : 2335 ( 3.60%)
[20,+inf) : 1468 ( 2.26%)

Concurrency : 5 [1 Daemon * 2G]
Total Searches : 225906
Slow Query Count (>=10 secs) : 972 ( 0.43%)
Total Duration (secs) : 186700.646
Mean Duration : 0.826
Mode Duration : 0.003
Minimum Duration : 0.001
Maximum Duration : 467.647
Duration Standard Deviation : 2.864
Duration (secs) Bins :-
[0,0.25) : 64621 (28.61%)
[0.25,0.5) : 58947 (26.09%)
[0.5,1) : 56894 (25.18%)
[1,1.5) : 19836 ( 8.78%)
[1.5,2) : 9397 ( 4.16%)
[2,3) : 7941 ( 3.52%)
[3,5) : 4810 ( 2.13%)
[5,10) : 2488 ( 1.10%)
[10,20) : 684 ( 0.30%)
[20,+inf) : 288 ( 0.13%)



Sphinx
Index Size: 60G

Concurrency: 30
Total Searches : 244431
Slow Query Count (>=10 secs) : 27479 (11.24%)
Total Duration (secs) : 1243474.213
Mean Duration : 5.087
Mode Duration : 0.007
Minimum Duration : 0.001
Maximum Duration : 1869.063
Duration Standard Deviation : 17.833
Average Queries : 2.783
Duration (secs) Bins :-
[0,0.25) : 51186 (20.94%)
[0.25,0.5) : 27798 (11.37%)
[0.5,1) : 32372 (13.24%)
[1,1.5) : 20490 ( 8.38%)
[1.5,2) : 16915 ( 6.92%)
[2,3) : 21833 ( 8.93%)
[3,5) : 23550 ( 9.63%)
[5,10) : 22808 ( 9.33%)
[10,20) : 14540 ( 5.95%)
[20,+inf) : 12939 ( 5.29%)

Concurrency: 5
Total Searches : 226528
Slow Query Count (>=10 secs) : 9895 ( 4.37%)
Total Duration (secs) : 453296.517
Mean Duration : 2.001
Mode Duration : 0.007
Minimum Duration : 0.001
Maximum Duration : 164.713
Duration Standard Deviation : 4.543
Average Queries : 2.773
Duration (secs) Bins :-
[0,0.25) : 71001 (31.34%)
[0.25,0.5) : 36500 (16.11%)
[0.5,1) : 32799 (14.48%)
[1,1.5) : 20416 ( 9.01%)
[1.5,2) : 16385 ( 7.23%)
[2,3) : 13951 ( 6.16%)
[3,5) : 13330 ( 5.88%)
[5,10) : 12251 ( 5.41%)
[10,20) : 7563 ( 3.34%)
[20,+inf) : 2332 ( 1.03%)

As per the analysis, for the dataset analyzed, Lucene was found to win convincingly over its rival.
More details on the same to come soon!

P.S: Though lucene works great for a lot of cases, so does sphinx. Here, lucene seemed to have an upper hand

9 comments:

simonw said...

I wonder if you can publish the various modifications you did to
either of those? If not would it be possible to run the benchmarks
against the vanilla versions? Did you use an already released version
of lucene or did you do modifications against the trunk? I found it
little hard to read - putting it into a table would make little more
sense to me though. I personally would be interested in what kind of
analysis you have applied to the index document and the queries
though.

Anonymous said...

Worthless without the source code.

Anshum said...

@Simon: It'd be a little difficult to publish the various modifications, though as per my mail on the group, running a benchmark is in my 'to do' list.
As for using the lucene version, I used a mod of Lucene 2.3.1 and an already released version of sphinx 0.99.
About putting the data in a table, would do that soon.
The queries were complex boolean phrase queries with query level boosting.
The analyzer used was a custom analyzer which was a blend of whitespace analyzer with a stop word list.
I'd add more details on my next post.

@Anonymous: It's a little difficult to publish that code as of now. Though at a later stage I might think of that code to be good for a contrib entry.

Anonymous said...

I think Lucene is more careful about fancy index encoding than Sphinx, and it also uses caching extensively yet judiciously. Sphinx on the other hand is much faster at indexing and has an important feature that Lucene lacks, which is "inline" stored fields that are inserted directly into the inverted index alongside the document ID each time that ID appears in the index. Since each ID can appear many times, that bloats up the index size, but the advantage is that you now receive the stored fields at the same time as the doc numbers at the sequential speed of the disk, without additional seeking. For faceted search when there are too many facet values to fit in memory, this should be able to give a huge speedup, so I've been wanting to try Sphinx for that feature alone (I'm currently using Solr resorting to one trick after another to make the stored field table fit in ram. The new field cache scheme is worth trying but my expectation is that the random access patterns to the facets may miss the cache rather often).

Lucene 2.9 (nightly builds) appears to be quite a bit faster than older versions, I think in part due to use of mmap or AIO instead of endless file handle operations in the index.

Anonymous said...

maybe 60GB index size is far more than 24GB RAM
just a guess

Anshum said...

That's correct. Indexes in lucene hence are so compressed as opposed to sphinx, its almost 1/3.
Agreed that in cases where there are facet values involved, sphinx would have an upper hand, but at the same time in all other cases lucene kind of rules. Even in cases where you'd want to use both, you'd have a slower sphinx search for non faceted cases on the same index. Is that really desirable?
As per my latest post, here the decision is, what is it that you want.. lucene is a good option, so is sphinx but only if you want the stuff that its good at.

Anshum said...

Maybe 60G is a lot more than available.. but again, the benchmark was really for 'X' Documents containing that amount of data. Its the index structure of sphinx that plays the spoilsport here (and I guess in all cases of large datasets)

Vijay Kumar said...

I do have one doubt .. we need to index some where around 5 millions of records and search .. As of now we are using Sphinx in php . I am trying with Lucene now . But indexing is taking more time in Lucence.. can you share some thoughts to increase the performace.. you can reach me @ vijaykumar.ravva@gmail.com
Thanks
Vijay

Vijay Kumar said...

Anshum Gupta,
We need to index 20-30 million records daily. As of now Lucene is indexing million records in 15 mins . but it doesnt match to the speed we want .Could you please let me know any optimization techniques to speed up my query . Its straigh forward selection from Data Base.I think indexing is taking very less time . The Accessing part is taking more time . Any inputs are higly appreciated.Could you please mail me to vijaykumar.ravva@gmail.com .Thanks Vijay