I want to switch gears now, in my series on the topology of data sets, to the the problem of *clustering*. Recall that we are looking for ways to analyze large sets of data, where each point is a vector that we think of as having been sampled from some distribution. If this distribution is a union of a number of disjoint distributions then we would expect the geometry of the data points to reflect this by forming a number of distinguishable, tightly packed “clusters”. The goal of a clustering algorithm (called *unsupervised learning* in the machine learning parlance) is to partition the data set into some number of subsets based only on its geometric structure.

Here’s an approach that many people first think of: Choose a parameter , connect any two points within distance by an edge, then define each connected component of the graph to be a cluster. This, of course, depends on the parameter , so you really need to do something more along the lines of persistent homology to look at all different parameters and choose the best one. The data structure that encodes this process is called a dendrogram. But there are still some problems; for example, different clusters in the same data set often have different densities and thus are recognized by different values of . Carlsson-Memoli have dealt with many of these problems by introducing a multiparameter version (PDF) of this algorithm. They’ve gotten some very impressive results with it, but that’s not what I’m going to focus on in this post.

Based on what I’ve read and heard, the most commonly used clustering algorithm appears to be K-means. This algorithm progressively improves a model for the data as a union of spherical Gaussian distributions. The progressive nature is reminiscent of a neural network, though there is no explicit gradient function. Here’s how it works:

We start out with a positive integer and a set of random, distinct points in the space containing the data. For step 1, we assign each data point to the cluster corresponding to the point that it is closest to. For step 2, we replace each with the centroid (essentially the average) of the points assigned to the cluster . Then we cycle through steps 1 and 2 until either the clusters become stable or we choose an arbitrary cutoff. So, roughly speaking, the algorithm makes clusters from the Vornoi cells defined by the points and attempts to choose the points so that the Vornoi cells are as round as possible. The idea is that if the data comes from a union of spherical Gaussian distributions, the final points will be the centers of these distributions.

K-means is easy to understand and relatively fast, but it also has a number of drawbacks. First, you have to choose the value in advance. In some cases, there will be an a priori reason to fix a certain , but in cases where you don’t know how many clusters to expect, you essentially have to use trial and error. Second, the final result is highly dependent on the initial points . One can get around this problem by running the algorithm repeatedly from different starting points, or there are a number of heuristic algorithms for choosing initial values.

But the major problem with K-means, which was also a problem with the other data mining algorithms I’ve discussed, is that it only works if the data fits the model, i.e. if the natural clusters in the data are “convex” enough. If the data doesn’t, then K-means still gives you an answer, but it’s essentially meaningless. (Data miners like the expression “Garbage in, garbage out” and I suspect K-means is one of the main reasons.) This problem can be fixed in some cases by preprocessing such as applying a kernel. But it seems like what one really needs are more flexible methods, and this is where topology (even low-dimensional topology) comes in. In my next post, I will discuss how ideas from low-dimensional geometry/topology can be used to create better clustering algorithms.

The points you make about K-means are all valid. That is why people use information theoretic clustering these days. Look at Tishby and Still Geometric Clustering using Information Bottleneck. In fact there is a massive ongoing project to parallelize this version. You should look into it.

Comment by Dan Marthaler — July 5, 2012 @ 11:00 pm |

Thanks for the suggestion. I see that Tishby has written a lot of papers on this. I’ll look into it and, if see if it’s appropriate for a future post.

Comment by Jesse Johnson — July 6, 2012 @ 1:47 pm |

Agreed, nice post and all points are valid. But please note that there are a number of K-Means variants that address each of these issues, with different degrees of success. Also, K-Means allows you to use different distance measures giving you different shapes for the clusters.

Comment by Renato Cordeiro de Amorim — July 30, 2012 @ 6:47 am |

Yes, that’s true. A lot of work has been done to address these issues with K-means, and I understand that there are some very effective versions of the algorithm. But if one starts with a more flexible basic algorithm (such as the one I describe in the next post) then I think there’s even more potential to refine it and add to it as has been done with K-means.

Comment by Jesse Johnson — July 30, 2012 @ 9:07 pm |