## Well, there goes the neighborhood…

Most clustering algorithms are frustratingly non-local, and what is frustrating at small scale becomes intractable at large scale. Limiting your scope to a neighborhood of items usually requires heuristics that are clustering algorithms in their own right (Yo dawg, I put some clustering in your clustering.) Any algorithm that requires a notion of pairwise similarity at best requires fetching many items from your data store, and at worse requires time and space.

Wouldn’t it be nice if you could look at an item once, and determine its cluster immediately without consulting any other data? Wouldn’t it be nice if the clusters were stable between runs, so that the existence of one item would never change the cluster of another? Simhashing does exactly this. There are many approaches to simhashing, in this document I’m going to talk only about my favorite. It’s simple to implement, mathematically elegant, works on anything with many binary features, and produces usable results. It’s also simple to analyze, so don’t let the notation scare you off.

## Comparing Two Sets

Suppose you have two sets, and , and you would like to know how similar they are. First you might ask, how big is their intersection?

That’s nice, but isn’t comparable across different sizes of sets, so let’s normalize it by the union of the two sizes.

This is called the Jaccard Index, and is a common measure of set similarity. It has the nice property of being 0 when the sets are disjoint, and 1 when they are identical.

## Hashing and Sorting

Suppose you have a uniform pseudo-random hash function from elements in your set to the range . For simplicity, assume that the output of is unique for each input. I’ll use to denote the set of hashes produced by applying to each element of , i.e. .

Consider . When you insert and delete elements from , how often does change?

If you delete from then will only change if . Since any element has an equal chance of having the minimum hash value, the probability of this is .

If you insert into then will only change if . Again, since any element has an equal chance of having the minimum hash value, the probability of this is .

For our purposes, this means that is useful as a stable description of .

## Probability of a Match

What is the probability that ?

If an element produces the minimum hash in both sets on their own, it also produces the minimum hash in their union.

if and only if . Let be the member of that produces the minimum hash value. The probability that and share the minimum hash is equivalent to the probability that is in both and . Since any element of has an equal chance of having the minimum hash value, this becomes

Look familiar? Presto, we now have a simhash.

## Tuning for Precision

This may be too generous for your purposes, but it is easy to make it more restrictive. One approach is to repeat the whole process with independent hash functions, and concatenate the results. This makes the probability of a match

I prefer an alternate approach. Use only one hash function, but instead of selecting only the minimum value as the simhash, select the least values. The probability of a match then becomes

and if ,

The advantage of this over independent hash functions is that it sets a minimum on the number of members that the two sets must share in order to match. This mitigates the effect of extremely common set members on your clusters. With several independent hash functions, a very common set member that produces low values in a small number of hash functions can cause a huge blowup of the resulting clusters. Selecting from a single hash function ensures that it can only effect one term. It is for this reason that many simhash implementations unrelated to this one take into account the global frequency of each feature, but this complicates their implementation.

## Turning Anything Into a Set

This algorithm works on a set, but the things we’d like to cluster usually aren’t sets. Mapping from one to the other is straightforward if each item has many binary features, but can require some experimentation to get good results. If your items are text documents, you can produce a set using a sliding window of n-grams. I’ve found 3-grams to work well on lyrics, but YMMV. Since there’s no order to the members of the set, it’s important to make them long enough to preserve some of the local structure of the thing you’d like to cluster.

## Pseudocode

int SimHash(Item item, int restrictiveness) Set set = SplitItemToSet(item) PriorityQueue queue for x in set queue.Insert(Hash(x)) simhash = 0 for x in [0 : restrictiveness] simhash ^= queue.PopMin() return simhash

## Further Reading

This specific technique is often referred to as “Min Hashing” in the literature, so that’s a good query to start looking for specific applications to your problem. It is a member of a general class of techniques called “Locality Sensitive Hashing,” often abbreviated as LSH. Google has patented an application of another simhashing technique that is generally unrelated to this one. The paper that introduces it is also available, as is Google’s paper on their implementation.

How to cluster? — Nice writeup for simhashing, but once I have this how do I cluster? Do I :1) randomly split my input into k sets2) run over all the elements and find the “correct” set to put them into3) recompute the sets based on step 2)4) …5) ProfitIf not, please explain how to do the actual clustering given your simhash.

Once each item has a simhash, group everything with the same simhash into a cluster. This is ideally suited for something like mapreduce, but any key-value store will do.

Pretty Cool — Lately I’ve been spending a lot of time debugging performance problems with a certain open source MapReduce implementation, so it was refreshing to read this.And of course, anything that references “Yo dawg” is just a bonus.

Dear Ryan –We are very pleased to announce that this Knol is the Gold badge winner for English Knols created in October, 2009. Congratulations. You may view your award at http://knol.google.com/k/peter-baskerville/top-pick-best-knols-of-the-month/14j3i4hyjvi88/60#.Top writers like you may benefit from participation in the ‘Google Knol LinkedIn Group’, located at http://www.linkedin.com/groups?gid=2185205&trk=hb_side_g. Please consider joining with us to add your point of view. Knol is listening!Great work, keep it up,Murry Shohat and Peter Baskerville

Knol Encyclopaedia — Sir,Can we make a handpick list of knols just like:http://knol.google.com/k/badan-barman/knol-encyclopaedia-of-library-and/3bexxvrdm2i7n/60to build Quality Knol encyclopedia where the included article will not be restricted to only one or two authors but from all the knols.. even with scope where user can suggest addition and deletion of the article in the said encyclopedia to make it a customer focused.Your suggestion are welcome.

Pingback: Unsupervised correction of optical character misrecognition | On the lambda