Optimizing Duplicate Document Detection in Apache Spark
Recently, I’ve been working on detecting duplicate or near duplicate documents. One of the visible group of users of duplicate detection are search engines.
The core technique for detecting duplicates is straightforward:
 Represent documents as vectors
 Compute the distances between all pairs of vectors
 Return those document pairs whose vectors had distances less than some threshold
The devil is in the details, as usual. How you preprocess the documents’ text and vectorize the documents can significantly impact accuracy. I’m using the same data set and approach I discussed in my talk at Spark Summit East 2016.
From a performance perspective, most applications use hashing techniques such as simhash, which can avoid allpairs calculations. I wanted to gain a better understanding of the trade offs between how vectors are weighted, the choices of distance functions, and the resulting distribution of distances, so I chose to not to use hash functions for now.
Thankfully, allpairs calculations are pleasantly parallel. I was able to accelerate my implementation easily with Apache Spark.
Along the way, I learned a few lessons which may be useful to others:

Distances between similar but not duplicate documents are larger with normalized binary occurrence vectors than term frequencyinverse document frequency vectors. Whereas the topic modeling work emphasized grouping documents with some similarities, we want to be restrict our predictions to documents that are guaranteed to be similar. Binary occurrence vectors are a better choice here.

Uncached RDDs kill the performance of
RDD.cartesian(...)
. The Cartesian product of an RDD with itself uses the RDD twice, causing the RDD to be calculated multiple times. By caching my RDD before computing the Cartesian product, I reduced my jobs’ run time from 2.5 hours to just 10 minutes! 
It can be more efficient to use collect then sort than to use
RDD.sortBy(...)
followed by a collect. In my case, I’m returning tens or hundreds of thousands of pairs of document labels and distances. These pairs can easily fit into memory and be sorted quickly on the master. TheRDD.sortBy(...)
method required an extra stage of computation and shuffling. 
When using
RDD.cartesian(...)
on an RDD with itself, you’ll get 2 pairs for each pair of documents, including pairs of documents with themselves. You’ll need to follow theRDD.cartesian(...)
operation with anRDD.filter()
operation to restrict the resulting pairs.