Anshum's Blog: Information Retrieval and all things related: August 2009

It has been a question that pops up every time (at-least I) write code. 'How generic should this be?' By Generic, I mean the power to (re)use the same piece of code, without changing anything 'inside' the code and only changing a configuration file (xml or whatever is the implementation choice).

More often than not I end up just trying to write a code so generic, that the purpose it was primarily built for (whatever application) is complicated. Correct, everything now is in the conf file, but at the same time writing the conf file in itself is a task wherein the only kind of people who would want to write that conf instead of rewriting an application specific code would be those who are 'programming challenged'.
I've realized, perhaps if only a few questions are answered prior to writing the so generic code, the developer/designer would be at such a level of ease.
* Who asked for it?
* Would someone else ever use it? Really? Or is it just a mere assumption that someday the world would run on it?
* Assuming that the world might run on it someday, do I need to write code for all of that right now? Can I just write what I want, optimize it for what is required at the moment and a little more and then just let it be? On the lines of the early ages of internet [Design it now, and let it get corrected as it goes on, with the future users correcting it themselves]

There are a lot of other questions which should be answered before the attempt to write 'the universal machine' is made. All attempts to write code are generally towards writing a universal machine which would do all we can think of, all we can imagine, and all that the machine would be able to imagine years from now on! :)
Lets write for 'now' and write it well designed.
Let them extend it rather than configure it....

There has long existed a battle among the "pro java" and "pro c" developer community. Not that I'd want to strictly be associated with one, but would always say the robustness/exception handling/stability vector have a better cosine value with java as compared to c [With all respect to C, for being the progenitor for Java]. Let me not move off on the tangent and rather more towards the core here. Last few weeks were spent trying to benchmark 'java' lucene as a search engine with 'C' sphinx. Though not exactly in their vanilla form, and a lot of modifications to both, we finally ran a lot of tests on both the engines.

Keeping a common playground with the following specifications:
Processor(s) : Intel Quad Core X 2
RAM : 24G
Operating System : RHEL 32 Bit
Document Corpus : 18 million+ Documents
Source Size : 90G [RDBMS Table]

On a corpus of approximately 18 million records, indexed and not all of them stored. Multiple field queries with varying boost values and some good level of complexity. Here is the result sheet:

Lucene
Index Size : 20G

Concurrency : 30 [5*6 Daemons, 2G each]
Total Searches : 64931
Slow Query Count (>=10 secs) : 3803 ( 5.86%)
Total Duration (secs) : 238094.574
Mean Duration : 3.667
Mode Duration : 0.835
Minimum Duration : 0.001
Maximum Duration : 1174.757
Duration Standard Deviation : 15.441
Search Time (secs) Distribution :-
[0,0.25) : 2770 ( 4.27%)
[0.25,0.5) : 5666 ( 8.73%)
[0.5,1) : 13515 (20.81%)
[1,1.5) : 10928 (16.83%)
[1.5,2) : 7330 (11.29%)
[2,3) : 8476 (13.05%)
[3,5) : 7222 (11.12%)
[5,10) : 5221 ( 8.04%)
[10,20) : 2335 ( 3.60%)
[20,+inf) : 1468 ( 2.26%)

Concurrency : 5 [1 Daemon * 2G]
Total Searches : 225906
Slow Query Count (>=10 secs) : 972 ( 0.43%)
Total Duration (secs) : 186700.646
Mean Duration : 0.826
Mode Duration : 0.003
Minimum Duration : 0.001
Maximum Duration : 467.647
Duration Standard Deviation : 2.864
Duration (secs) Bins :-
[0,0.25) : 64621 (28.61%)
[0.25,0.5) : 58947 (26.09%)
[0.5,1) : 56894 (25.18%)
[1,1.5) : 19836 ( 8.78%)
[1.5,2) : 9397 ( 4.16%)
[2,3) : 7941 ( 3.52%)
[3,5) : 4810 ( 2.13%)
[5,10) : 2488 ( 1.10%)
[10,20) : 684 ( 0.30%)
[20,+inf) : 288 ( 0.13%)

Sphinx
Index Size: 60G

Concurrency: 30
Total Searches : 244431
Slow Query Count (>=10 secs) : 27479 (11.24%)
Total Duration (secs) : 1243474.213
Mean Duration : 5.087
Mode Duration : 0.007
Minimum Duration : 0.001
Maximum Duration : 1869.063
Duration Standard Deviation : 17.833
Average Queries : 2.783
Duration (secs) Bins :-
[0,0.25) : 51186 (20.94%)
[0.25,0.5) : 27798 (11.37%)
[0.5,1) : 32372 (13.24%)
[1,1.5) : 20490 ( 8.38%)
[1.5,2) : 16915 ( 6.92%)
[2,3) : 21833 ( 8.93%)
[3,5) : 23550 ( 9.63%)
[5,10) : 22808 ( 9.33%)
[10,20) : 14540 ( 5.95%)
[20,+inf) : 12939 ( 5.29%)

Concurrency: 5
Total Searches : 226528
Slow Query Count (>=10 secs) : 9895 ( 4.37%)
Total Duration (secs) : 453296.517
Mean Duration : 2.001
Mode Duration : 0.007
Minimum Duration : 0.001
Maximum Duration : 164.713
Duration Standard Deviation : 4.543
Average Queries : 2.773
Duration (secs) Bins :-
[0,0.25) : 71001 (31.34%)
[0.25,0.5) : 36500 (16.11%)
[0.5,1) : 32799 (14.48%)
[1,1.5) : 20416 ( 9.01%)
[1.5,2) : 16385 ( 7.23%)
[2,3) : 13951 ( 6.16%)
[3,5) : 13330 ( 5.88%)
[5,10) : 12251 ( 5.41%)
[10,20) : 7563 ( 3.34%)
[20,+inf) : 2332 ( 1.03%)

As per the analysis, for the dataset analyzed, Lucene was found to win convincingly over its rival.
More details on the same to come soon!

P.S: Though lucene works great for a lot of cases, so does sphinx. Here, lucene seemed to have an upper hand

Anshum's Blog: Information Retrieval and all things related

Saturday, August 29, 2009

Generic Code? Extensible Code?

Wednesday, August 12, 2009

Lucene Vs Sphinx - A Showdown on a large dataset