Big Data Series: Vector Space Modelling, text analysis

Wikipedia defines Vector Space Models as: “an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers, such as, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings” (source). So, we might want to use it to identify how important a certain word is to the overall theme of a document.

Note: text is unstructured data. It can be seen as a series of strings with no attributes or relationships.

Below, are our three documents. They’re uniquely named: ‘document 1’, ‘document 2’ and ‘document 3’. For the example (to make the maths easier a bit later), we’ll say that we have a total of 24 documents, but we’ll work from the data we have in this subset.

So, what we can do here, is create a term frequency matrix (TF). As it suggests, it logs the frequency of each term across each document.

Now, to derive value from this, we need to calculate the ‘inverse document frequency (IDF)’. Wikipedia defines IDF as: “In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or body of text.  It is often used as a weighting factor in searches of information retrieval, text mining, and user modelling. The tf-idf value increases proportionally to the number of times a word appears in the document, but is often offset by the frequency of the word in the body of text, which helps to adjust for the fact that some words appear more frequently in general.” Source.

So, to calculate our IDF we need to populate each term in a table along with their total frequency, across all 3 documents. For the purpose of this example, I’m just going to choose two terms:

You’re probably wondering what the formula in the IDF column is. Well let’s step through it.

Log2(X) enables us to calculate the power to which you need to raise 2 to get X. So: if 3 squared is 9 then Log2(9) = 3.

So in our example, we have LOG2(24/3) where 24 is the total number of documents in our set & 3 is the total number of times our particular word appeared.

So in this case, we have LOG2(8) which equals 3. Because 2 to the power of 3 equals 8.

The next step is to multiply our IDF figures against the TF figures in our previous table.

The higher the score, the rarer the word is in the body of text. For the sake of another example, let’s say that we have 100 documents and 86 have our term present in them. The IDF would be LOG2(100/86), which is LOG2(1.16) and hence the result is 0.2 (a cool calculator is available here).

If we use the same example, but say that we have 100 documents and only 4 have our term present, we would have LOG2 (100/4) which is LOG2(25) which results in 4.64.

From these two examples, you can see that the rarer a term is, the higher the IDF score. In the IDF-TF matrix, it enables us to generate a weighting or importance for a term in each document.