Well, there goes the neighborhood…
Most clustering algorithms are frustratingly non-local, and what is frustrating at small scale becomes intractable at large scale. Limiting your scope to a neighborhood of items usually requires heuristics that are clustering algorithms in their own right (Yo dawg, I put some clustering in your clustering.) Any algorithm that requires a notion of pairwise similarity at best requires fetching many items from your data store, and at worse requires time and space.
Wouldn’t it be nice if you could look at an item once, and determine its cluster immediately without consulting any other data? Wouldn’t it be nice if the clusters were stable between runs, so that the existence of one item would never change the cluster of another? Simhashing does exactly this. There are many approaches to simhashing, in this document I’m going to talk only about my favorite. It’s simple to implement, mathematically elegant, works on anything with many binary features, and produces usable results. It’s also simple to analyze, so don’t let the notation scare you off.
Comparing Two Sets
Suppose you have two sets, and , and you would like to know how similar they are. First you might ask, how big is their intersection?
That’s nice, but isn’t comparable across different sizes of sets, so let’s normalize it by the union of the two sizes.
This is called the Jaccard Index, and is a common measure of set similarity. It has the nice property of being 0 when the sets are disjoint, and 1 when they are identical.
Hashing and Sorting
Suppose you have a uniform pseudo-random hash function from elements in your set to the range . For simplicity, assume that the output of is unique for each input. I’ll use to denote the set of hashes produced by applying to each element of , i.e. .
Consider . When you insert and delete elements from , how often does change?
If you delete from then will only change if . Since any element has an equal chance of having the minimum hash value, the probability of this is .
If you insert into then will only change if . Again, since any element has an equal chance of having the minimum hash value, the probability of this is .
For our purposes, this means that is useful as a stable description of .
Probability of a Match
What is the probability that ?
If an element produces the minimum hash in both sets on their own, it also produces the minimum hash in their union.
if and only if . Let be the member of that produces the minimum hash value. The probability that and share the minimum hash is equivalent to the probability that is in both and . Since any element of has an equal chance of having the minimum hash value, this becomes
Look familiar? Presto, we now have a simhash.
Tuning for Precision
This may be too generous for your purposes, but it is easy to make it more restrictive. One approach is to repeat the whole process with independent hash functions, and concatenate the results. This makes the probability of a match
I prefer an alternate approach. Use only one hash function, but instead of selecting only the minimum value as the simhash, select the least values. The probability of a match then becomes
and if ,
The advantage of this over independent hash functions is that it sets a minimum on the number of members that the two sets must share in order to match. This mitigates the effect of extremely common set members on your clusters. With several independent hash functions, a very common set member that produces low values in a small number of hash functions can cause a huge blowup of the resulting clusters. Selecting from a single hash function ensures that it can only effect one term. It is for this reason that many simhash implementations unrelated to this one take into account the global frequency of each feature, but this complicates their implementation.
Turning Anything Into a Set
This algorithm works on a set, but the things we’d like to cluster usually aren’t sets. Mapping from one to the other is straightforward if each item has many binary features, but can require some experimentation to get good results. If your items are text documents, you can produce a set using a sliding window of n-grams. I’ve found 3-grams to work well on lyrics, but YMMV. Since there’s no order to the members of the set, it’s important to make them long enough to preserve some of the local structure of the thing you’d like to cluster.
int SimHash(Item item, int restrictiveness) Set set = SplitItemToSet(item) PriorityQueue queue for x in set queue.Insert(Hash(x)) simhash = 0 for x in [0 : restrictiveness] simhash ^= queue.PopMin() return simhash
This specific technique is often referred to as “Min Hashing” in the literature, so that’s a good query to start looking for specific applications to your problem. It is a member of a general class of techniques called “Locality Sensitive Hashing,” often abbreviated as LSH. Google has patented an application of another simhashing technique that is generally unrelated to this one. The paper that introduces it is also available, as is Google’s paper on their implementation.