Anshum's Blog: Information Retrieval and all things related: Searching off the RAM

Search engines are a lot about precision, recall and speed. These three factors pretty much define the quality of a search engine. I'd only talk about the last point here, speed. The time taken to search for a search engine is such a critical factor that an improvement of a few hundred milliseconds is of extreme importance to anyone associated with developing/designing search engines.
More often than not, as a short term gain, all of us look at putting in more money on the hardware to improve a system's performance. Though this might look like a solution, its bound to fail if you try to run away from actually fixing the application architecture, which happens to be the root cause for poorly performing applications generally.
For those who have already done whatever it takes to optimize the search process, here are a few ways that are generally used to host the search index on the RAM, in order to improve the search speed.
You may mount a tmpfs/ramfs on your machine and copy the index on it. You may then open index reader/searcher on this copy of the index. This would help in reducing the I/O latency and improve the search speed.
The difference between using tmpfs vs ramfs are:

ramfs will grow dynamically unlike tmpfs which is static and initialized with a fixed size.
tmpfs uses swap memory whereas ramfs doesn't.

I have personally used tmpfs and it works rather efficiently.
One thing to remember is that both tmpfs and ramfs are volatile. They both get erased on a system restart and hence you'd need to re-mount and copy the index on system startup.

Mounting tmpfs:
mkdir -p /mnt/tmp
mount -t tmpfs -o size=20m tmpfs /mnt/tmp

Mounting ramfs:
mkdir -p /mnt/ram
mount -t ramfs -o size=20m ramfs /mnt/ram

The other approach in case you're using lucene is to have the IndexReader use a Ram Directory. In this case you'd need a JVM instance with enough memory to load all of the index contents into memory and have more for processing search queries.
Also, that may translate to 'requiring'a 64-bit JVM so that it could use pseudo unlimited address space.

--Snip--

NIOFSDirectory nDir = new NIOFSDirectory(new File(indexDir));
RAMDirectory directory = new RAMDirectory(nDir);
IndexReader ir = IndexReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(ir);

--Snip--

I haven't been really fond of using this approach as it forces a 64-bit architecture to be used but it may surely work as there's no overhead of using and manipulating the data manually as in the case of maintaining a tmpfs. The cleanups etc...

These are the basic 2 techniques to be used if you want your index to be fed off the RAM. It is a frequent question on the lucene users mailing list, so perhaps people can now stop asking that question... well.. almost...
All said and done.. don't stop optimizing the engine/app if your search is slow.. 99% of the times.. that is where it has to be handled.

Anshum's Blog: Information Retrieval and all things related

Tuesday, August 9, 2011

Searching off the RAM

No comments: