Big Data Series: Getting hands on with Data Vector Models

In this article, we’re going to get hands on with Apache Lucene. This tool is suitable for any application that requires full text indexing and searching capabilities. We can use Lucene to generate recommendations systems, e.g ‘if you like this document, you should try this one’.

These guidelines assume that Apache Lucene is already installed. Which, if you’re using the Cloudera VM we used in previous articles, it will be.

So, first off, we need to import files from our ‘data’ directory and load them into Lucene. We do that with ‘./ data’

Now,, we need to enter a query to search for in the document. In this case, I’ve used ‘news’. As a result, you can see the scores for each document.

We can dig a little deeper if we want and take a look at the TF-IDF You can compare this against the image above. This shows that originally, doc#0 had a score of 0.02. However, now that we’ve multiplied it with the TF table, we have a total value of 1.42, which suggests that the term ‘news’ appeared 71 times.

In some cases, we may want to search for two terms. In this case, I want to search for ‘news’ and ‘elections’ but I want ‘news’ to be more heavily weighted. That is, if a document shows ‘news’ but not ‘election’ I want it to have a better score. And ‘news’ should always be the more important term. We use ‘term1 ^factor term2’ to do this.