Thursday, August 4, 2011

Dealing with High Dimensional Data

I believe that data is best represented as a vector. For those who haven't heard this before, well let me start with a very basic example of 'how' this is done.

Lets assume a corpus of documents which have only 2 unique words(A,B) in its dictionary (If that was hard to follow, comment and I shall follow up with what that means). Now a document containing only 'A' is a unit vector along the direction of 'A' and so with a document containing only a single occurrence of 'B'. Documents with 'x' As and 'y' Bs can hence be represented as :
x a + y b (a and b are unit vectors along A and B).

When the corpus comprises of documents wherein there are a lot of terms with a very low document frequency, it is referred to as high dimensional data. An example would be a list of proper nouns e.g. hotel names.

High dimensional data, poses a lot of issues primarily due to its sparseness in the vector space. The sparseness of data makes a lot of tasks like clustering and tagging challenging. In order to process this data, more often than not, there is a need for reducing the dimension of the documents (sparseness). I'll discuss a relatively easy way to reduce the dimension of such data.

Given a corpus of high dimensional data, create document vectors for each of them. Create a term frequency matrix for the corpus and follow it up with dropping off all terms that occur in less than 10% (might vary as per the corpus/dataset) documents. Statistically this should remove around 60% of the documents
Also, removing the terms that occur in more than 80% of the documents would lead to removing a considerable ratio of terms that are redundant and too frequent. Such terms are generally tagged as stop-words and removed under all normal data/text processing algorithms.

The residue that remains now is of a considerably reduced dimension. This is a straightforward way of projecting the original data on a multi dimensional plane. A plane comprising of all dimensions that were reduced.

This data can now be consumed for any processing viz. clustering, classification etc..

Posts on how to cluster and various clustering techniques would soon follow.. unlike this one which took ages!

No comments: