Friday, October 21, 2011

Computing Document Similarity using Lucene Term Vectors

Someone asked me a question recently about implementing document similarity, and since he was using Lucene, I pointed him to the Lucene Term Vector API. I hadn't used the API myself, but I knew in general how it worked, so when he came back and told me that he could not make it work, I wanted to try it out for myself, to give myself a basis for asking further questions.

I already had a Lucene index (built by SOLR) of about 3000 medical articles for whose content field I had enabled term vectors as part of something I was trying out for highlighting, so I decided to use that. If you want to follow along and have to build your index from scratch, you can either use a field definition in your SOLR schema.xml file similar to this:

1
2
   <field name="content" type="text" indexed="true" stored="true" 
       termVectors="true" termPositions="true" termOffsets="true"/>

or enable term vectors in your Lucene indexing code like so:

1
2
  doc.addField(new Field("content", content, Store.YES, Index.ANALYZED,
    TermVector.WITH_POSITIONS_OFFSETS));

Note that you don't need positions and offsets enabled for term vectors for doing what I am doing, but I had it on in my schema.xml file already, so decided (to avoid confusion) to show the equivalent Lucene call in the code snippet above.

Once you have the index, find the list of all the terms in the "content" field across the entire index. These terms would represent the entries in the document vector for a document. Then for the documents which you want to consider for your similarity computation, extract its term vector. The term vector gives you two arrays, an array of terms within this document, and the corresponding frequency of that term in this document. Using these three data structures, it is easy to construct a (sparse) document vector representing the document(s).

Once you have the two document vectors, the similarity between the two can be computed using Cosine Similarity (or other measure - most of them work with document vectors).

For my test, I chose 4 documents out of my corpus, the titles of which (and their associated Lucene document IDs) are shown below:

1
2
3
4
Document#0: Mitral valve surgery - minimally invasive (31825)
Document#1: Mitral valve surgery - open (31835)
Document#2: Laryngectomy (31706)
Document#3: Shaken baby syndrome (30)

Notice that the first two are both about surgery and about the same body part, so intuitively I should expect a very high similarity between the documents 0 and 1. Document 2 is still about surgery, so I should expect a good, but smaller similarity value between documents 0 and 2. Document 3 is completely different, so I should expect an even smaller similarity value between documents 0 and 3.

The code for this is shown below (as a JUnit test).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
package com.mycompany.termpos;

import java.io.File;
import java.util.HashMap;
import java.util.Map;

import org.apache.commons.math.linear.OpenMapRealVector;
import org.apache.commons.math.linear.RealVectorFormat;
import org.apache.commons.math.linear.SparseRealVector;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermEnum;
import org.apache.lucene.index.TermFreqVector;
import org.apache.lucene.store.FSDirectory;
import org.junit.Assert;
import org.junit.Test;

public class TermVectorBasedSimilarityTest {

  @Test
  public void testSimilarity() throws Exception {
    IndexReader reader = IndexReader.open(
      FSDirectory.open(new File("/path/to/my/index")));
    // first find all terms in the index
    Map<String,Integer> terms = new HashMap<String,Integer>();
    TermEnum termEnum = reader.terms(new Term("content"));
    int pos = 0;
    while (termEnum.next()) {
      Term term = termEnum.term();
      if (! "content".equals(term.field())) 
        break;
      terms.put(term.text(), pos++);
    }
    int[] docIds = new int[] {31825, 31835, 31706, 30};
    DocVector[] docs = new DocVector[docIds.length];
    int i = 0;
    for (int docId : docIds) {
      TermFreqVector[] tfvs = reader.getTermFreqVectors(docId);
      Assert.assertTrue(tfvs.length == 1);
      docs[i] = new DocVector(terms); 
      for (TermFreqVector tfv : tfvs) {
        String[] termTexts = tfv.getTerms();
        int[] termFreqs = tfv.getTermFrequencies();
        Assert.assertEquals(termTexts.length, termFreqs.length);
        for (int j = 0; j < termTexts.length; j++) {
          docs[i].setEntry(termTexts[j], termFreqs[j]);
        }
      }
      docs[i].normalize();
      i++;
    }
    // now get similarity between doc[0] and doc[1]
    double cosim01 = getCosineSimilarity(docs[0], docs[1]);
    System.out.println("cosim(0,1)=" + cosim01);
    // between doc[0] and doc[2]
    double cosim02 = getCosineSimilarity(docs[0], docs[2]);
    System.out.println("cosim(0,2)=" + cosim02);
    // between doc[0] and doc[3]
    double cosim03 = getCosineSimilarity(docs[0], docs[3]);
    System.out.println("cosim(0,3)=" + cosim03);
    reader.close();
  }
  
  private double getCosineSimilarity(DocVector d1, DocVector d2) {
    return (d1.vector.dotProduct(d2.vector)) /
      (d1.vector.getNorm() * d2.vector.getNorm());
  }

  private class DocVector {
    public Map<String,Integer> terms;
    public SparseRealVector vector;
    
    public DocVector(Map<String,Integer> terms) {
      this.terms = terms;
      this.vector = new OpenMapRealVector(terms.size());
    }
    
    public void setEntry(String term, int freq) {
      if (terms.containsKey(term)) {
        int pos = terms.get(term);
        vector.setEntry(pos, (double) freq);
      }
    }
    
    public void normalize() {
      double sum = vector.getL1Norm();
      vector = (SparseRealVector) vector.mapDivide(sum);
    }
    
    public String toString() {
      RealVectorFormat formatter = new RealVectorFormat();
      return formatter.format(vector);
    }
  }
}

The results of running this test are shown below, and match up with my expectations, as shown below:

1
2
3
  cosim(0,1)=0.9672563304581603
  cosim(0,2)=0.7496880467405691
  cosim(0,3)=0.23968406019250035

Using Lucene's term vectors to generate document vectors can be useful not only for similarity calculations, but for other tasks where you need document vectors, such as clustering and classification. Computing document vectors from raw data is typically more involved (requires development time and can have scalability issues) than loading data into a Lucene index.

The other scalability issue is the size of the document vectors. My index is relatively tiny (about 3.000 documents), but we are still talking 10,000 unique terms after stopwording and stemming. However, document vectors are typically quite sparse (my test document vectors have an average density of 0.01), so an OpenMapRealVector is a good choice of data structure, and should scale to much larger sizes as well.

25 comments (moderated to prevent spam):

Anonymous said...

In this example there are only 4 documents being tested.But when i tried to compute cosine similarity of about 100 document from my lucene index i got an error


"Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at org.apache.lucene.index.TermVectorsReader.readTermVector(TermVectorsReader.java:499)
at org.apache.lucene.index.TermVectorsReader.readTermVectors(TermVectorsReader.java:383)
at org.apache.lucene.index.TermVectorsReader.get(TermVectorsReader.java:345)
at org.apache.lucene.index.SegmentReader.getTermFreqVectors(SegmentReader.java:813)
at org.apache.lucene.index.DirectoryReader.getTermFreqVectors(DirectoryReader.java:502)"

i dont know what is happening.is there any solution to this?
i wanted to find out all the cosine similarity between all the document in my index.

Sujit Pal said...

The DocVector class in the example is inefficient for large number of documents. Each instance creates its own terms map, try pulling this out into a static or operating on the SparseRealVector directly.

Pawan Kumar said...

Thanks for sharing this great piece of information. It is very helpful for me.

Sujit Pal said...

You are welcome, Pawan.

Nesli said...

Hello,

I have to get the similarities between documents. Particularly between movie-descriptions, to generate recommendations.

I have created an index with Solr. (with Lucene, it was to difficult for me, I'm a novice).

I have tried to get the cosine similarity with this code, but there are some failures.

For the cosinus test I have indexed 3 example-documents (movie-contents) with only two fields: id and content.

At first, what is the difference between the types "text" and "text_general". If I tried to set the type on "text", solr didn't startet. So I changed it to text_general.

In FSDirectory.open(new File("/path/to/my/index")) I've set the path to the Solr-Index-Folder "apache-solr-3.5.0\\example\\solr\\data\\index"

Is there the mistake?

Have you any tips?
please help.

Sujit Pal said...

Hi Nesli, from the SOLR schema example, text_general field type is missing the termVectors setting, which is required by the code (since its reading term vectors which must be present).

But since you are using SOLR, and all you care about is showing "similar" or "more-like-this" docs, you may want to consider just using SOLR's built in More Like This feature. That may be easier to do (also requires term vectors, but no coding, only configuration).

Anonymous said...

Hi Sujit,
many thanks for your help. I have tried it with MLT. It was helpful, I received similar movies.
But for my thesis I have to select the tf-idf-values of movie-descripton to compute the cosineSimilarity myself.
I must write a cosine similarity calculator anyway in java.
My current status is, that I got the tf-idf values from the solr-xml-response and put them in hash-maps.
Now my problem is: any movie has a genre (often more than one).
Its like the "cat"-field (category) in the exampledocs with ipod/monitor etc. and its an important point.
How can I integrate this factor?
If I not involve the genre of a movie, I will compute the similarity only based on its description-terms.
Have you an approach? Or should I only enable the "termVector"-attribute like by the film-description,
and add the "genre"-term-values to my computation?
Regards

Sujit Pal said...

Hi, thanks and you are welcome. Regarding the genre, it really depends on the application, but one approach could be to compute an additional measure of similarity based on genre overlap (something like Jaccard's coefficient), and then make your similarity = a * cosine_similarity + genre_similarity where a is a weighting factor (maybe 1).

Unknown said...

Hi Sujit Pal,

Right now I am working on Nutch 1.4. How can I achieve something similar in nutch. I want access to the term frequency vector corresponding to each document while crawling (not indexing).

I believe the nutch indexer must be using term vector for finding the scores.. As indexing is now moved to SOLR, where can i find the code for the same - nutch or solr..
plz help...

Sujit Pal said...

Hi Unknown, you can do the same thing with Nutch's Solr. You need to enable termVectors on the "content" field it as noted towards the beginning of the post. Underlying Solr is a Lucene index, so you can run the code against the underlying index to find inter-document similarity. But this is /after/ indexing rather than during crawling or indexing as you are looking for.

I don't think there is a way to do this while crawling in Nutch (or any crawler), as far as I know. As for scoring, its all governed by Lucene, and you can modify it to some extent by overriding Lucene's similarity, but the default implementation is based on cosine similarity and does a pretty good job for most use cases.

vijith said...

Thanks for the reply. Pardon me if this is out of context.

Since I am doing focused crawling with nutch, I need to find the similarity during the crawl to update the score (set by OPIC plugin).

So I thought of implementing the similarity calculation in a scoring plugin. So I want to see the code in solr before I do that.

Sujit Pal said...

Hi Vijith, already replied on the discussion thread on the gora mailing list.

Aida said...

Oh man, when ever I needed something get done with lucene, sooner or later i ended up on your blog. From now on, ill not google any more, but come straight here :)
Thanx for all the articles, I learned a lot from them!

Sujit Pal said...

Thanks Aida, glad it helped.

Anonymous said...

Hi Sujit,

Thank you so much for posting this article, just what I was looking for...I love your blog post - they are easy to understand the concepts.

I do have a question - I have been trying MoreLikeThis (MLT) within Solr. In MLT, how can similarity scores used for calculation be retrieved?

Say, if the 4 articles were added to Solr, how can the same end-result you have achieved be achieved within Solr, i.e. how can I use MLT to get
Article1 is related to Article2 by <<>
Article1 is related to Article3 by <<>
etc.

Thanks again for your advanced help.

Sujit Pal said...

Thanks Anonymous. For the MLT score shouldn't this be possible by explicitly declaring the required fields from the similar docs (using fl=score,...)?

Just found a very very interesting writeup on how MLT works, it has some more configurable parameters than my approach such as the ability to drop words below a certain length, etc. So MLT is probably preferable to my approach unless you are just trying to figure out whats going on.

astro said...

Is the normalization necessary?
What if we do not
docs[i].normalize();

Sujit Pal said...

I think it is, here's why. The docs[i].normalize() converts down all documents to a unit size, so the effect of differences in document length is removed. Usually the similarity is used to answer the question "how different thematically is this document from this other document" and document size is not important in the answer.

Raje said...

Thanks for your article!

I'm doing project based on XML!

I want to calculate the weight of the XML document!

Is it possible by means of this cosine similarity?

Sujit Pal said...

Hi Raje, you are welcome. Cosine similarity measures the similarity between two documents or between a query and a document. To calculate the absolute weight of a document you could probably just use the square root of the sum of the squares of its individual term dimensions (think Pythagoras theorem in n-dimensions). Since you mention XML, I am guessing there is something specific to XML, perhaps tag boosts (ie weight words occurring with <title> tags higher, etc) that you will also want to consider.

Peter Lavin said...

Hi Sujit,

Can you suggest how I would do the same thing but use terms from two fields in each document?

Peter

Sujit Pal said...

Already answered via email, but putting in here also because the question was interesting and for others who may have similar questions. One way could be to to union the {term, freq} maps for "content" and "otherfield", and then use the resulting map for the DocVector constructor. Then change the DocVector.setEntry() method to be additive, ie, if a term already exists then its frequency is added to rather than reset.

Yolanda Sjafrizal said...

Thank you very much for your post. It very helpful for my final project now.

I am very new in using lucene, so my question maybe a little bit silly.
I am using lucene 3.6.1 and when I use this program, it said that cannot find class SpareRealVector.

is it because i'am using lucene with this version?

Praveen said...

Hi,



I have query somewhat similar to this post.

I would like to interact Lucene api's using term vectors for both indexing and querying.



Let me clarify my problem. Suppose I have term vector representation for each document in my corpus.

doc_1 = {(vocab_1, floatVal_1);(vocab_2, floatVal_2);...(vocab_k, floatVal_k) }.

.

.

doc_n = {(vocab_1, floatVal_1);(vocab_2, floatVal_2);...(vocab_k, floatVal_k) }.



Here "vocab_1" is a term from the vocabulary, "float_1" is the custom normalized score calculated for the corresponding term. "k" is the size of the vocabulary.



Once I have created the Index, I would like to query it in similar fashion where I produce a term vector representation for the query and retrieve the results.



I did search in internet for help regarding this topic and got to know about use of payloads in lucene especially for indexing thinking that payloads could be the substitute for the float_vals that I computed. But it seems that payloads are not supported for querying purposes. All together I am not really convinced whether I should use payloads as it is quite memory intensive.



Thanks in advance.

Sujit Pal said...

Hi Praveen, if I understand correctly, the functionality you are looking for is natively supported by Lucene - it will allow you to search for /any/ word in your documents and retrieve documents by the score. If you would like to restrict to only those in your controlled vocabulary, you could build a custom TokenFilter that only passes these words (or its stems) through and include it in your analyzer chain. The scoring in this case would be Lucene's default. If you want to have custom scores for your CV terms, take a look at Function queries, you can modify the ranking based on field contents. Payloads would only be needed if you wanted to score metadata (such as annotations) differently from the main text.