Plagiarism detection with MinHash: Part 1

I recently encountered the MinHash algorithm, which gave me an idea for a project. You can read the original paper; what follows is an informal overview. You can also find the source code for the resulting project on GitHub.

The MinHash Algorithm

The abstract problem that MinHash solves is determining set similarity: how many items do two sets have in common? We can define an intuitive proxy for this idea. Given sets \(A\) and \(B\), we want their Jaccard coefficient, defined by

\[ J(A, B) = \frac{\left|A \cap B\right|}{\left|A \cup B\right|}\]

Or, in words, the fraction of the two sets that is in their intersection. This probably seems like it should be an easy problem to solve; just represent the two sets as hash tables and then compute the sizes of their intersection and union in the obvious way:

def similarity(l1, l2):
    A = set(l1)
    B = set(l2)
    return len(A.intersection(B)) / len(A.union(B))

Indeed that works, but it does not scale very well. Let \(k\) be the number of sets, and \(n\) the size of the largest set. Then the method above requires \(O(kn)\) time to put everything in hash tables, and \(O(n)\) time for each of \(O(k^2)\) pairs of sets. In all, that is \(O(k^2n)\). The space to hold all of these hash tables, meanwhile, is \(O(kn)\). Of course, for large enough \(k\) and \(n\), this becomes infeasible.

Who needs to compare similarities of lots of large sets? The original application was actually in search engines, but I find that example a little abstruse. Instead, consider plagiarism detection. Systems like turnitin.com allow students to (be forced to) submit their work and have it be compared against all other students' and a large corpus of other works, such as papers in academic journals. Commercial plagiarism-detection systems compare each submitted document against many thousands of other documents; if each document is a set of a few tens of thousands of elements, there is no way they have the memory or processing power to compare all of them.

What we want is a short signature of each set that we can compare to the signature of another set to guess the similarities of the set. If this signature is a few hundred bytes long, millions of them can be stored in memory, and comparisons can be very cheap.

This is exactly what MinHash does. To compute the signatures of a set of sets, we first generate \(h\) random hash functions, for some small \(h\) (typically hundreds). Then, to compute the signature of the \(i\)th set \(S_i\), for each hash function \(f_j\) we let \[m_j = \underset{S_i}{\arg\min}(f_j)\] That is, \(m_j\) is the element of \(S_i\) that minimizes that hash function. The signature is then the sequence \(m_1, m_2, \ldots, m_h\).

To estimate the similarity of two sets based on their signatures \(m_1\) and \(m_2\), we take \[\left|i: m_{1, i} = m_{2, i}\right|\] Why does this somewhat alchemical formula work as expected? Formally, the similarity of the signatures of two sets is an unbiased estimator of their Jaccard similarity. That is, letting \(M(A, B)\) be the MinHash estimate of \(J(A, B)\), \(E(M(A, B)) = J(A, B)\).

Here's my justification for this claim, which isn't quite a proof: clearly, in the limit of large \(h\) \(E(M(A, B))\) is the probability that a random hash function has the same minimum value in \(A\) and \(B\). Assuming no collisions, this is equivalent to the probability that a random hash function's minimum value in \(A \cup B\) lies in \(A \cap B\). For a truly random hash function, this is just \(\frac{|A \cap B|}{|A \cup B|}\), or \(J(A, B)\).

One caveat, as you might have noticed, is that any real hash function sometimes gives collisions. We can hand-wave this away by saying that there are plenty of practical hash functions which have a very low probability of collisions.

There is a more subtle problem, though. It is actually very difficult even to specify, let alone generate, a truly random hash function. Note that each permutation of a set corresponds to a hash function on that set; just let its hash value be its position in the permutation. Thus the number of hash functions of a set is at least the number of permutations of the set. If the size of the domain of our random hash functions (i.e., the number of unique elements in your sets) is \(d\), there are \(d!\) possibe permutations of the domain, and thus \(\log(d!)\) bits are required to specify one such permutation. By Stirling's approximation this is \(\Omega(d \log(d))\). If \(d\) is of order \(10 ^ 9\), which is perfectly possible for real applications, this is infeasible. Again, we can hand-wave this away: randomly picking a hash function from some suitably random-seeming subset of the possible permutations ought to be perfectly acceptable, even if that subset is relatively small. If we just specify a hash function with a 64-bit integer there are about \(10 ^ {19}\) possible hash functions.

Applying MinHash to Plagiarism

On Caltech's Board of Control, which handles cases of academic dishonesty among the undergraduates, we're always struck by how bad many students are at cheating. Many of our cases involve wholesale copy-pasting from other students. As a TA, I can see why we get that impression: student's assignments are never systematically compared to each other, so only really egregious cases get spotted, and then only when the submissions are graded by the same person.

Unfortunately, the BoC's impression is probably subject to selection bias: there are plenty of good cheaters, but they rarely get caught. Moreover, there are probably lots of bad cheaters who get lucky and still aren't caught. This shouldn't be hard to prevent. You can probably tell where I'm going with this...

I decided to create a simple, simple-to-use plagiarism detection system for plain text, with the idea that all assignments for a class can be compared against each other to find similar pairs. In my next post, I'll describe the design and implementation of the system.