Clustering with a Key-Value Store

Let’s say you have a dataset you’d like to cluster. Let’s say you don’t want to write more than 5 lines of code. Let’s say that your only tool is a key-value store. (Why might you be in this position? Perhaps your dataset is really really (really) big and only simple things will scale. Maybe it’s in fact INFINITE and you’re clustering a stream. Maybe MapReduce is just a really big hammer. 🔨👷 Why you’d only want to write 5 lines of code is left as an exercise to the reader.) 

At any rate, you would like to make clusters out of your data, but you only get to look at each item once in isolation. After looking at it you have to decide what cluster it should go to, at that moment, without looking at any other information, or any other items in your dataset. You only get one shot, do not throw it away! How can we accomplish this?

Ideally we want a magic function, key(X) where key(X) = key(Y) if and only if X and Y should be in the same cluster. We don’t have such a function! (sorry) But we do have something pretty close.

Let’s say you have a function H(X) that computes a hash of your item, and your hash has the following property, \Pr\left[H(X) = H(Y)\right] = sim(X, Y) for some similarity measure sim. H(X) is called a “locality sensitive hash” for sim. This is pretty close to we want! Things that are similar to each other will have a high probability of sharing a key, and things that are dissimilar to each other have a low probability of sharing a key. Later we’ll talk about how to make this behave a bit more like our magic function, but for now, lets talk about how to build this one.

// C++ code implementing each algorithm will be in a block like this one
// at the end of each section.

Building a Simple H(X): MinHash

Suppose you have two sets, X and Y, and you would like to know how similar they are. First you might ask, how big is their intersection?

\displaystyle |X\cap Y|

That’s nice, but isn’t comparable across different sizes of sets, so let’s normalize it by the union of the two sizes.

\displaystyle\text{J}(X,Y) = \frac{|X\cap Y|}{|X\cup Y|}

This is called the Jaccard Index, and is a common measure of set similarity. It has the nice property of being 0 when the sets are disjoint, and 1 when they are identical.

Suppose you have a uniform pseudo-random hash function h from elements in your set to the range [0,1]. For simplicity, assume that the output of h is unique for each input. I’ll use h(X) to denote the set of hashes produced by applying h to each element of X, i.e. \{h(X_1),h(X_2),h(X_3),...,h(X_n)\}.

Consider \min(h(X)). When you insert and delete elements from X, how often does \min(h(X)) change?

If you delete i from X then \min(h(X)) will only change if h(i)=\min(h(X)). Since any element has an equal chance of having the minimum hash value, the probability of this is \frac{1}{|X|}.

If you insert i into X  then \min(h(X)) will only change if h(i)<\min(h(X)). Again, since any element has an equal chance of having the minimum hash value, the probability of this is \frac{1}{|X|+1}.

For our purposes, this means that \min(h(X)) is useful as a stable description of X.

What is the probability that \min(h(X))=\min(h(Y))?

If an element produces the minimum hash in both sets on their own, it also produces the minimum hash in their union.

\min(h(X))=\min(h(Y)) if and only if \min(h(X\cup Y))=\min(h(X))=\min(h(Y)). Let i be the member of X\cup Y that produces the minimum hash value. The probability that X and Y share the minimum hash is equivalent to the probability that i is in both X and Y. Since any element of X\cup Y has an equal chance of having the minimum hash value, this becomes

\displaystyle \frac{|X\cap Y|}{|X\cup Y|}

Look familiar? Presto, we now have a Locality Sensitive Hash for the Jaccard Index.

\displaystyle H(X) = \min(h(X))

\displaystyle \Pr\left[H(X) = H(Y)\right]  = \frac{|X\cap Y|}{|X\cup Y|}

unsigned long long int MinHash(T X) {
  unsigned long long int min_hash = ULLONG_MAX;
  for (auto x : X) {
    min_hash = min(min_hash, Hash(x));
  return min_hash;

Tuning Precision and Recall with Combinatorics

Now that we have a locality sensitive hash, we can use combinatorics to build something that looks a bit more like our magic function. We can concatenate (or sum) hashes to perform an “AND” operation. Let H^A(X) = H(X, 1) + H(X, 2) + \ldots + H(X, A), i.e. concatenating A independent hashes. The probability that H^A(X) = H^A(Y) is then sim(X,Y)^A.

We can then output multiple hashes to perform an “OR” operation. If we output O independent hashes, then the probability that at least one of those hashes is the same for two items is 1 - (1 - sim(X,Y))^O.

Using these two tools, we can apply a “sigmoid” function to our similarities, outputting O independent copies of A concatenated hashes, the probability that two items will share at least one key is 1 - (1 - sim(X,Y)^A)^O. (You can think of this as the probability that two items will “meet each other” at least once in the course of your computation.)

The probability of at least one match with 6 “Ands” and 5 “ORs” as a function of similarity.

Now we can be pretty sure that things that are similar to each other will share at least one key, and things that aren’t won’t. We can increase the sharpness of the sigmoid as much as we want by spending more storage and CPU to increase A and O.

Great! Clustering with a key-value store! Now let’s talk about ways to improve H(X).

unsigned long long int MinHash(T X, int s) {
  unsigned long long int min_hash = ULLONG_MAX;
  for (auto x : X) {
    // Note that the hash function must now accept a seed.
     min_hash = min(min_hash, Hash(x, s));
  return min_hash;

void EmitWithKeys(T X, int ands, int ors) {
  for (int o = 0; o < ors; o++) {
    unsigned long long int key = 0;
    for (int a = 0; a < ands; a++) {
      // we assume that a large int is enough keyspace that we
      // can get away with adding instead of concatenating.
      key += MinHash(X, a + o * ands);
    Emit(key, X);

Integer Weights

This algorithm works on a set, but the things we’d like to cluster usually aren’t sets. For instance, terms in a document (even long n-grams) can occur multiple times and ideally we don’t just want to just discard counts of how often they occur. How can we fix this?

Easy! Just hash everything multiple times. If an element occurs n times in your set, hash it n independent times, insert each of them, and then take the minimum of this expanded set as your hash just like before.

We expand our hash function to accept both an item and an integer as its argument, h(i, j). And if item i occurs n times, then insert h(i, 1), h(i, 2), \ldots, h(i, n). Now think about what this means for the intersection and the union of these sets. If the count of i in object X is n, and the count of i in object Y is m, then the intersection of the two sets of hashes has \min(n, m) items, and the union has \max(n, m) items. You can imagine stacking these hashes on top of each other to form a histogram.


And then when you perform the intersection and union operations, these translate into performing a min and max across the values of each item.


So now that we’re working with vectors instead of sets, if you interpret the Jaccard Index on this expanded set in terms of a weighted vector, it turns into the “Weighted Jaccard Index.”

\displaystyle \text{J}_\mathcal{W}(x,y)= \frac{\sum_i \min(x_i, y_i)}{\sum_i \max(x_i, y_i)}

unsigned long long int IntegerWeightMinHash(map<T, int> X, int s) {
  unsigned long long int min_hash = ULLONG_MAX;
  for (auto x : X) {
    for (int i = 0; i < x.second; ++i) {
      min_hash = min(min_hash, Hash(x.first, i));
  return min_hash;

If you’d like to have \frac{\sum_i \min(x_i, y_i)}{\sum_i \max(x_i, y_i)} as your match probability but with real weights instead of integer weights, there are two good algorithms to do so, one for dense data, and the other for sparse data (like the original MinHash). They are complicated enough that I’m not going to talk about them more here, but not so complicated that you’ll have trouble implementing them from the algorithm definitions in the papers.

Real Weights and Probability Distributions

Instead of going deep into the algorithms for \text{J}_\mathcal{W} let’s do something simpler. We’ll go back to the first MinHash technique we discussed, and start tweaking it. We started with uniformly distributed hashes, but now we want to bias them in some way. We want items with higher weight to tend to have smaller hashes, so that they’re more likely to be the minimum. Ideally we’d like the probability that they are the minimum to be exactly proportional to their weight. It turns out that this is really easy to do. If h(i) is uniform between 0 and 1, and x_i is the weight of i then we transform each hash as follows. - \frac{\log(h(i))}{x_i} Then we take the item with the minimum hash as our representative just like we did before.

\displaystyle H(x) = \text{argmin}_i \frac{-\log h(i)}{x_i}

The probability that the hashes collide is the “probability jaccard index” which generalizes the Jaccard index to probability distributions. (For a derivation, see our paper on minhashing probability distributions.)

\displaystyle \Pr\left[H(x) = H(y)\right] = \sum_{i} \frac{1}{\sum_{j} \max\left(\frac{x_j}{x_i}, \frac{y_j}{y_i}\right)} = \text{J}_\mathcal{P}(x,y)

This is a strange looking formula, what is it? And why is it the Jaccard Index of probability distributions? We’ll get to that in the next section. First, let’s figure out why this particular transformation works. It has to do with the beautiful things exponential random variables do when you sort them.

A variable X is exponentially distributed if \Pr\left[X<y\right] = 1 - e ^{-\lambda y}. It has one parameter, \lambda called its “rate.”

The distribution of the minimum of two random variables is easy to derive from their distributions. \Pr\left[\min(X_1,X_2) < y \right] = \Pr\left[X_1 > y\right]\Pr\left[X_2 >y \right] If X_1 and X_2 are exponential with rates \lambda_1, \lambda_2 then this simplifies to \Pr\left[\min(X_1,X_2) < y \right] = 1 - e ^{-(\lambda_1 + \lambda_2) y}, another exponential! In addition, the probability that X_1 = \min(X_1, X_2) is \frac{\lambda_1}{\lambda_1 + \lambda_2}. Proof:

\displaystyle\Pr\left[X_1 = \min(X_1, X_2)\right]

\displaystyle\int_{y=0}^\infty \Pr\left[X_1 \in (y, y+dy)\right]\Pr\left[X_2>y\right] dy

\displaystyle\int_{y=0}^\infty\Pr\left[X_1 \in (y, y+dy)\right]\left(1 -\Pr\left[X_2<y\right] \right) dy

\displaystyle\int_{y=0}^\infty \lambda_1 e ^{-\lambda_1 y} e ^{-\lambda_2 y} dy

\displaystyle\lambda_1 \int_{y=0}^\infty e ^{-(\lambda_1 + \lambda_2) y}  dy

\displaystyle\frac{\lambda_1}{\lambda_1 + \lambda_2}

This gives us a new way of sampling from a distribution. If I have a probability distribution with parameters x_1, \ldots, x_n, I can generate a set of independent exponentials with rates x_1, \ldots, x_n, and then the index with the minimum exponential is a sample from the original distribution.

So, back to minhashing. Instead of generating exponential random variables, now we want to generate “exponentially distributed hashes.” To do that, all you have to do is take a uniformly distributed hash from 0 to 1, invert the exponential CDF, and apply it.

Then if we take the minimum of all these exponentially distributed hashes, we have a sample from the distribution that is stable under small changes, just like the original MinHash!

T ProbabilityMinHash(map<T, float> X, int s) {
  pair<double, T> min_hash;
  min_hash.first = std::numeric_limits::infinity;
  double maximum_hash = ULLONG_MAX;
  for (auto x : X) {
    double zero_one_hash = Hash(x.first) / maximum_hash;
    pair<double, T> exponential_hash(-log(zero_one_hash) / x.second, x.first);
    min_hash = min(exponential_hash, min_hash);
  return min_hash.second;

Understanding the Probability Jaccard Index

How do you compute the “union” of two probability distributions? How do you compute an “intersection?” One proposal might be to use the \min and \max as we did for the weighted Jaccard index above. This worked for “weighted sets” where the mapping from set to vector is to give every element a weight of either 0 or 1. What happens if you convert these sets to probability distributions instead? In that case we need it to sum to 1, so rather than giving each element a 0 or a 1, we’ll give each element either a 0 or a 1/|X|. When we do that, the weighted Jaccard index no longer generalizes the Jaccard index, in fact, it’s almost always less. If X and Y are sets, with |X| > |Y|, and x and y the corresponding uniform probability distributions, then \text{J}(X,Y) \neq \text{J}_\mathcal{W}(x,y). Instead, for |X| > |Y|,

\displaystyle\text{J}_\mathcal{W}(x,y) = \frac{|X\cap Y|}{|X\setminus Y| + |X|} < \text{J}(X,Y)

\text{J}_\mathcal{P} can be thought of as rescaling the two distributions before computing each term of \text{J}_\mathcal{W} so that the \min in the numerator has no effect.

How can we fix this? We’d like to normalize the distributions in some way so that this no longer happens. Our goal is to make it so that uniform distributions have the same likelihood of colliding as when we applied MinHash to the set that they’re distributed over. Let’s generalize \text{J}_\mathcal{W}(x,y) to allow the distributions to be rescaled before we compute each term.

\displaystyle\text{J}_\mathcal{W}(x,y,\alpha) = \sum_i \frac{\min(\alpha_i x_i, y_i)}{\sum_j\max(\alpha_i x_j, y_j)}

Now, since \text{J}_\mathcal{W} is too low, let’s choose the vector \alpha to maximize the collision probability we get out. If \alpha_i x_i > y_i, increasing \alpha_i raises only the denominator. If \alpha_i x_i < y_i, increasing \alpha_i raises the numerator proportionally more than the denominator. So the optimal \alpha sets \alpha_i x_i = y_i.

\displaystyle\sum_i \frac{\min(y_i, y_i)}{\sum_j\max(\frac{y_i}{x_i} x_j, y_j)} = \sum_i\frac{1}{\sum_{j} \max\left(\frac{x_j}{x_i}, \frac{y_j}{y_i}\right)} = \text{J}_\mathcal{P}(x,y)

2D-simplexsvgYou can derive another representation using the algorithm definition. A set of exponential random variables, when normalized to sum to 1, is a uniformly distributed sample from the unit-simplex. (A unit-simplex is an equilateral triangle in n-dimensions.) In addition, because a unit-simplex is just the set of non-negative vectors that sum to 1, every point on the unit simplex is a probability distribution.

From this, we can represent \text{J}_\mathcal{P} geometrically.

Take the unit simplex and mark the point on the simplex that corresponds to the distribution you would like to hash. Connect that point to each corner of the simplex with an edge. These edges you’ve drawn split the simplex into smaller simplexes, where each one has area proportional to the weight of one of the elements of your distribution.

With the distribution represented this way, you can imagine sampling from it by throwing darts at the simplex. The id of whichever region the dart lands in is a sample from your distribution. The probability MinHash does exactly this, except that the dart remains in the same spot for each distribution you sample from, and the boundaries of the regions shift around it.

Split a the unit simplex (n-dimensional triangle) into sub-simplices proportional to the weights of each distribution. Fix a random point in it. The sub-region your point lands in is your hash of the original distribution. This lets us visualize \text{J}_\mathcal{P} geometrically as an intersection of simplexes.

So now, we’re able to visualize \text{J}_\mathcal{P} on this simplex. When you overlay two distributions on top of each other, and intersect the simplexes of the same color, the area that remains is equal to \text{J}_\mathcal{P}.

This representation also lets us demonstrate geometrically the most interesting fact about \text{J}_\mathcal{P} and the strongest reason to call it the Jaccard Index of Probability Distributions.

Theorem: \text{J}_\mathcal{P} is maximal.
For any sampling method G, if \Pr\left[G(x) = G(y)\right] > \text{J}_\mathcal{P}(x,y), then for some z where \text{J}_\mathcal{P}(x,z)>\text{J}_\mathcal{P}(x,y) and \text{J}_\mathcal{P}(y,z)>\text{J}_\mathcal{P}(x,y), either \Pr\left[G(x) = G(z)\right] < \text{J}_\mathcal{P}(x,z) or \Pr\left[G(y) = G(z)\right] < \text{J}_\mathcal{P}(y,z).

In other words, no method of sampling from discrete distributions can have collision probabilities that are greater than \text{J}_\mathcal{P} everywhere. If you try to get higher recall on one pair than this algorithm provides, you have to give up recall on at least one “good” pair, a pair that is more similar than the pair you are trying to improve.

This theorem is true of \text{J} and \text{J}_\mathcal{P} but not \text{J}_\mathcal{W}! The analogous z for the original MinHash is Z = X \cup Y. The original MinHash, and the probability MinHash both make X,Y,Z collide as much as possible.

The proof of this in the general case is complicated, but the simplex representation allows us to prove it purely visually on three element distributions.

A visual proof on 3 element distributions. The green regions of x and y could be shifted to overlap more, but doing so would remove collisions between both of them and z^\text{green}.

We form a new distribution out of the intersection of the green simplexes. This makes a distribution that is in-between x and y but collides with both of them as much as possible on every possible outcome. For x and y to collide more on the green element, they have to give up collisions with z^\text{green} on either red or blue, and can’t gain any more on green to compensate.

For a lot more about this see our Probability Minhash paper.