In the previous post in my continuing series on the topology of big data, I described how a set of techniques called persistent homology can be used to infer the large scale topological structure of a data set. The problem is that while the basic algorithm gives you generators of the homology groups, it doesn’t necessarily give you the nicest/shortest representatives. In this post, I’ll discuss a number of approaches to turning the information from persistent homology into a more concrete representation of the data.
As described last time, the persistent homology algorithm constructs a sequence of simplicial complexes and calculates an algebraic structure called a bar code that describes which homology elements are “persistent” through a long range of these complexes. If you pick a long element and a complex that it appears in, then you’re left with the problem of finding a homology cycle in the complex of minimal length/area/volume (depending on the dimension) in an equivalence class.
Zomorodian and Carlsson  (PDF) devised an algorithm to approximate a given cycle by finding open covers (in the topological sense) of the complex that reflect the given homology class. (Thanks to Mikael Vejdemo-Johansson for pointing me to this paper in the comments on my previous post.) A number of computer scientists have devised algorithms that are guaranteed to find the minimal cycle in a complex of a certain type (usually a surface). The most general result like this that I’ve found is by Dey, Hirani and Krishnamoorthy  (PDF) who solve the problem for any complex in which all the subcomplexes of a certain type have torsion free homology. They translate the problem into a linear programming problem and thus produce a polynomial time algorithm.
Of course, it’s probably more useful in general to have a simple representation for the whole space rather than just the homology generators. Adams, Atanasov and Carlsson  describe an algorithm for creating a simplified graph (a one-dimensional skeleton) representing the data based on a persistence complex and a density function. Their one-dimensional skeleton is, roughly, the one defined by the handle decomposition induced by the density function, thought of as a Morse function. (This is probably more closely related to Carlsson and Memoli’s multi-parameter clustering algorithm, which I’ll discuss in a future post.)
But perhaps you want even more than a one-dimensional skeleton. If you can use topological tools such as persistent homology to build a really thorough understanding of the topology of your data (or are willing to guess), you can also use an algorithm called a self organizing map (sometimes called a Kohonen network). This approach gets its inspiration from neural networks and has a very similar feel. It works as follows: Choose a simplicial complex whose topology matches the expected topology of the data and map it into the space in which the data is embedded. Then cycle through the data points and have each data point “pull” any nearby vertices of the complex towards it. When any vertex of the complex is moved, the adjacent vertices in the complex are moved in the same direction, but not as far. (This is how the algorithm makes use of the topology of the complex.) In the same way that a neural network (hopefully) converges by flowing along gradients defined by cycling through the data, the self organizing map (hopefully) pulls the complex toward a reasonable approximation of the data. If the complex happens to be a manifold then this is an example of manifold learning.
The self organizing map predates persistent homology, but it seems like the two ideas would be very compatible when used together – persistent homology to determine what topological model to use and the self organizing map to turn the topological model into a geometric model of the data. I don’t know if this has been done anywhere before, but if you do you can let me know in the comments.