In the last post in my continuing series on the topology of large data sets, I discussed the problem of clustering and promised that we would finally get to some low dimensional, geometric topology. (I wrote about persistent homology a few posts back, but this approach comes out of K-theory, which is anything but low dimensional.) Recall that a clustering algorithm attempts to partition a data set into subsets that are intrinsic to the geometric structure of the data points. This is not a well defined problem by mathematical standards, but one common way to improve things slightly is by replacing the individual data points with a graph whose vertices are the data points and whose edges have weights measuring the similarity between points (for example, the Gaussian of the distance between them). Define the *boundary* of a vertex set as the number of edges with exactly one endpoint in the set and the *area* of the boundary as the sum of these edge weights. The *volume* of the vertex set is the sum of the weights of edges with both endpoints in the set. A good cluster will be one with very small boundary area but large volume, and whose complement also has large volume. This is starting to sound like geometric topology.

In particular, recall that the *Cheeger constant* of a metric space is the minimum value of taken over all partitions of into two sets where is the area of the boundary of either set and is the smaller of the two volumes. There are natural generalizations to larger numbers of subsets, which the reader can work out. But one way to pose the clustering problem is to find a partition that approximates the Cheeger constant of the similarity graph.

This is the motivation behind an algorithm called spectral clustering: If you diagonalize the Laplacian operator on the graph, the eigenvector corresponding to the lowest eigenvalue is a function whose domain is the vertex set. The vertices on which the function is positive or negative forms a partition, and basic results of harmonic analysis say that this partition will give you a reasonable approximation of the Cheeger constant. Finding the eigenvectors involves linear algebra, which computers are good at, so this algorithm is fairly fast. It also lacks the rigid geometric constraints that I’ve complained about with respect to certain other algorithms, so it’s a step in the right direction. However, it does rely on an indirect approximation and there are examples where this approximation turns out to be way off. We can do better.

Note that the Cheeger constant is defined essentially identically for graphs or Riemannian manifolds. In dimension three, moreover, there is a very strong connection between the geometric structure and the topology/combinatorics of the space thanks to the work of Thurston, Mostow, etc. For example, thanks to the Gauss-Bonnet Theorem, for surfaces of bounded curvature there is a linear relation between area and genus. In particular, Marc Lackenby has demonstrated that Heegaard surfaces (completely topological objects) are related to Cheeger constants [1]. The connection between Heegaard surfaces and data analysis comes via thin position.

I’ve written a lot about thin position on this blog, but for any new readers, thin position is essentially a method for rearranging the critical points of a Morse function in order to minimize the genus (or some other type of complexity) of its level sets. In order to translate this into the data setting, we have to decide what a Morse function should be. Any one-to-one function on the vertices determines an ordering of the vertices, so bijections from the vertices to the integers will play the role of Morse functions. For each , the set of vertices ordered before has a boundary, and these boundaries will play the role of the level sets. The vertices will play the role of critical points and we’ll try and reorder them to simultaneously minimize the areas of all the level sets. (More precisely, we’ll minimize the function with respect to lexicographic order on the tuples of areas.)

In thin position for three-manifolds, the proceedure is to move an index-one critical point up past an index-two point whenever possible. In the data setting, there isn’t a notion of index, but there is reasonably straightforward criteria that one can work out to decide when a permuting one vertex past another will reduce the areas of the boundaries. The details are in a paper I recently posted on the arXiv [2].

In classical thin position, the level sets where the genus is a local minimum are incompressible surfaces, and it turns out that there is an analogous condition on the vertex set consisting of the first vertices for each local minimum with respect to area. Vertices in such a set are more connected to other vertices in the set than to vertices outside, and vertices outside are more connected to each other than to vertices inside. I call sets that meet this condition *pinch clusters*. Note that this is not directly related to the Cheeger constant because it does not take into account volume. But this actually makes the resulting algorithm more effective for finding unbalanced clusters that have much smaller or larger volume than their complements.

The most exciting thing is that finding clusters this way turns out to be effective on real world data sets. Doug Heisterkamp (who is in the Computer Science department here at OSU) has implemented the algorithm and has tested it on a number of different types of data sets [2]. (The code and executables are available here.) In computer science, you have to give every algorithm an acronym, so Doug and I have named the vertex re-ordering algorithm TILO (Topologically Intrinsic Lexicographic ordering) and we have an algorithm that chooses the best local minima to cut along called PRC (Pinch Ratio Clustering).

So the upshot is that thin position leads to a very interesting and useful analogy between incompressible surfaces in three-dimensional manifolds and cluster boundaries in data analysis. My hope is that this annalogy will make it possible to translate even more ideas from low-dimensional topology into data analysis, and to translate problems from data analysis into topology, thus enriching both fields.

Awesome stuff!

Comment by Cormac — January 9, 2013 @ 3:52 am |