Anshum's Blog: Information Retrieval and all things related: August 2011

Monday, August 29, 2011

Indexing MongoDB Data for Solr

Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project.

MongoDB (from "humongous") is a scalable, high-performance, open source, document-oriented data store.

I was happy using MongoDB and my very own search engine written using/extending lucene, until the trunks for Solr and Lucene were merged. This merge translated to Solr using the same release of lucene that I was using, unlike the past when there was some disconnect between the two. I realized that a lot of what I was trying to build was available through Solr.

Though Solr is used by a lot of organizations (which can be found here) and I'm sure that at least a few of them using Mongo, for some reason there was/is no straight forward out of the box import handler for data stored in MongoDB.

This made me search for a framework/module/plug to do the same, but in vain.

All said and done, here's a way that I finally was able to index my mongodb data into Solr.

I've used SolrJ to access my Solr instance and a mongo connector to connect to Mongo. Having written my own sweet layer that has access to both the elements of the app, I have been able to inject data as required.

--snip--

public SolrServer getSolrServer(String solrHost, String solrPort) throws MalformedURLException {
String urlString = "http://"+solrHost+":"+solrPort+"/solr";
return new CommonsHttpSolrServer(urlString);

}

--/snip--

Fire the mongo query, iterate and add to the index

--snip--

SolrServer server = getSolrServer(..); //Get a server instance

DBCursor curr = ..; //Fire query @ mongo, get the cursor

while (curr.hasNext()) { //iterate over the result set
BasicDBObject record = (BasicDBObject) curr.next();
//Do some magic, get a document bean

server.addBean(doc);

}

server.commit();

--/snip--

This will get you started on your track to index mongo data into a running Solr instance.

Also, remember to configure Solr correctly for this to run smooth.

Download Resources:

Tuesday, August 9, 2011

Searching off the RAM

Search engines are a lot about precision, recall and speed. These three factors pretty much define the quality of a search engine. I'd only talk about the last point here, speed. The time taken to search for a search engine is such a critical factor that an improvement of a few hundred milliseconds is of extreme importance to anyone associated with developing/designing search engines.
More often than not, as a short term gain, all of us look at putting in more money on the hardware to improve a system's performance. Though this might look like a solution, its bound to fail if you try to run away from actually fixing the application architecture, which happens to be the root cause for poorly performing applications generally.
For those who have already done whatever it takes to optimize the search process, here are a few ways that are generally used to host the search index on the RAM, in order to improve the search speed.
You may mount a tmpfs/ramfs on your machine and copy the index on it. You may then open index reader/searcher on this copy of the index. This would help in reducing the I/O latency and improve the search speed.
The difference between using tmpfs vs ramfs are:

ramfs will grow dynamically unlike tmpfs which is static and initialized with a fixed size.
tmpfs uses swap memory whereas ramfs doesn't.

I have personally used tmpfs and it works rather efficiently.
One thing to remember is that both tmpfs and ramfs are volatile. They both get erased on a system restart and hence you'd need to re-mount and copy the index on system startup.

Mounting tmpfs:
mkdir -p /mnt/tmp
mount -t tmpfs -o size=20m tmpfs /mnt/tmp

Mounting ramfs:
mkdir -p /mnt/ram
mount -t ramfs -o size=20m ramfs /mnt/ram

The other approach in case you're using lucene is to have the IndexReader use a Ram Directory. In this case you'd need a JVM instance with enough memory to load all of the index contents into memory and have more for processing search queries.
Also, that may translate to 'requiring'a 64-bit JVM so that it could use pseudo unlimited address space.

--Snip--

NIOFSDirectory nDir = new NIOFSDirectory(new File(indexDir));
RAMDirectory directory = new RAMDirectory(nDir);
IndexReader ir = IndexReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(ir);

--Snip--

I haven't been really fond of using this approach as it forces a 64-bit architecture to be used but it may surely work as there's no overhead of using and manipulating the data manually as in the case of maintaining a tmpfs. The cleanups etc...

These are the basic 2 techniques to be used if you want your index to be fed off the RAM. It is a frequent question on the lucene users mailing list, so perhaps people can now stop asking that question... well.. almost...
All said and done.. don't stop optimizing the engine/app if your search is slow.. 99% of the times.. that is where it has to be handled.

Thursday, August 4, 2011

Dealing with High Dimensional Data

I believe that data is best represented as a vector. For those who haven't heard this before, well let me start with a very basic example of 'how' this is done.

Lets assume a corpus of documents which have only 2 unique words(A,B) in its dictionary (If that was hard to follow, comment and I shall follow up with what that means). Now a document containing only 'A' is a unit vector along the direction of 'A' and so with a document containing only a single occurrence of 'B'. Documents with 'x' As and 'y' Bs can hence be represented as :

x a + y b (a and b are unit vectors along A and B).

When the corpus comprises of documents wherein there are a lot of terms with a very low document frequency, it is referred to as high dimensional data. An example would be a list of proper nouns e.g. hotel names.

High dimensional data, poses a lot of issues primarily due to its sparseness in the vector space. The sparseness of data makes a lot of tasks like clustering and tagging challenging. In order to process this data, more often than not, there is a need for reducing the dimension of the documents (sparseness). I'll discuss a relatively easy way to reduce the dimension of such data.

Given a corpus of high dimensional data, create document vectors for each of them. Create a term frequency matrix for the corpus and follow it up with dropping off all terms that occur in less than 10% (might vary as per the corpus/dataset) documents. Statistically this should remove around 60% of the documents

Also, removing the terms that occur in more than 80% of the documents would lead to removing a considerable ratio of terms that are redundant and too frequent. Such terms are generally tagged as stop-words and removed under all normal data/text processing algorithms.

The residue that remains now is of a considerably reduced dimension. This is a straightforward way of projecting the original data on a multi dimensional plane. A plane comprising of all dimensions that were reduced.

This data can now be consumed for any processing viz. clustering, classification etc..

Posts on how to cluster and various clustering techniques would soon follow.. unlike this one which took ages!