A few weeks ago, I started a new blog called The Shape of Data, which will focus on explaining the geometry behind modern data analysis, along the lines of the series of posts I wrote on this blog about a year ago. This involves very basic geometry/topology, so I didn’t think it would appropriate for LDTopology. I will continue to post to LDTopology about pure topology, but today I wanted to write a few words about why I started the new blog and what I hope to accomplish with it.

## March 21, 2013

## November 14, 2012

### Digital spheres and Reeb graphs

This post is about a slightly different direction in the topology of data sets than what I’ve written about previously, but one that is much more explicitly low-dimensional. When a doctor takes an MRI scan of a patient’s brain, they get a three dimensional lattice of points labeled foreground (brain) and background (not brain). The boundary between these two sets should be a sphere, and this is important for things like mapping regions of the brain to biological functions. However, because the MRI is not 100% accurate, even if your brain is a sphere, the MRI data may not be. Two electrical engineers (David Shattuck and Richard Leahy [1]) proposed a method to determine if the boundary of a brain scan is a sphere by building a certain graph from the cross sections of the scan. This seems to be one of the rare instances in which a cleanly stated math problem just drops out of science and engineering. Shortly after this, three mathematicians (Lowell Abrams, Donniell Fishkind and Carey Priebe [2]) picked the conjecture up, proved it, then proved a number of follow-up results. Below the fold, I’ll describe the conjecture and how it’s related to a classical topological tool – the Reeb graph of a Morse function.

## July 13, 2012

### Topological clustering

In the last post in my continuing series on the topology of large data sets, I discussed the problem of clustering and promised that we would finally get to some low dimensional, geometric topology. (I wrote about persistent homology a few posts back, but this approach comes out of K-theory, which is anything but low dimensional.) Recall that a clustering algorithm attempts to partition a data set into subsets that are intrinsic to the geometric structure of the data points. This is not a well defined problem by mathematical standards, but one common way to improve things slightly is by replacing the individual data points with a graph whose vertices are the data points and whose edges have weights measuring the similarity between points (for example, the Gaussian of the distance between them). Define the *boundary* of a vertex set as the number of edges with exactly one endpoint in the set and the *area* of the boundary as the sum of these edge weights. The *volume* of the vertex set is the sum of the weights of edges with both endpoints in the set. A good cluster will be one with very small boundary area but large volume, and whose complement also has large volume. This is starting to sound like geometric topology.

## July 5, 2012

### Finding clusters in data

I want to switch gears now, in my series on the topology of data sets, to the the problem of *clustering*. Recall that we are looking for ways to analyze large sets of data, where each point is a vector that we think of as having been sampled from some distribution. If this distribution is a union of a number of disjoint distributions then we would expect the geometry of the data points to reflect this by forming a number of distinguishable, tightly packed “clusters”. The goal of a clustering algorithm (called *unsupervised learning* in the machine learning parlance) is to partition the data set into some number of subsets based only on its geometric structure.

## June 23, 2012

### Refining information from persistent homology

In the previous post in my continuing series on the topology of big data, I described how a set of techniques called *persistent homology* can be used to infer the large scale topological structure of a data set. The problem is that while the basic algorithm gives you generators of the homology groups, it doesn’t necessarily give you the nicest/shortest representatives. In this post, I’ll discuss a number of approaches to turning the information from persistent homology into a more concrete representation of the data.

## June 12, 2012

### Topological exploration of data sets: Persistent homology

In this post, I continue my series on the topology of large data sets and, in particular, I will finally get to some topology. In my last few posts, I have tried to make the point that many of the widely used approaches to analyzing large sets of high dimensional data are only equipped to handle data that adheres to relatively simple, essentially linear models. While there are some methods for generalizing the algorithms to non-linear data sets (I mentioned kernels as one example) choosing the right method requires that you know something about the large scale structure of the data. In this post, I will discuss one approach to discovering the large scale topological structure, namely *persistent homology*. (For any readers who who are coming from the data mining side of things: I will assume below that the reader knows what homology groups are. If you don’t, you should still read on, but I’ll refer you to one of the books here or here for the details.)

(more…)

## May 27, 2012

### Making linear data algorithms less linear: kernels

In my last two posts, I described two approaches to high dimensional data analysis that are very linear: neural networks implicitly use planes to divide data sets into subsets, while principle component analysis looks for the best linear projection of a data set. (Both were part of my ongoing series on the topology of big data.) In practice, data sets often don’t conform to linear models, but there is a common method for trying to work around this: a *kernel* is a non-linear embedding of the original vector space into a higher dimensional vector space, with the goal of making the image set more compatible with linear algorithms. (This is unrelated to the term kernel used in algebra.)

## May 8, 2012

### Principal Component Analysis

I will continue my series of posts on the geometry and topology of big data with a description of *principle component analysis* (PCA), a technique from the statistics side of data analysis. Like my last post, there will be nothing explicitly related to or drawn from low dimensional topology, but bear with me. After a few more posts about the standard approaches to these problems, I will address how ideas drawn from topology can be applied to these sorts of problems. As mentioned before, I believe these applications have the potential to enrich the pure field with new problems and new techniques.

## April 23, 2012

### The geometry of neural networks

For the next few parts in my continuing series on the geometry/topology of high dimensional data, I plan to write about some of the classical approaches and algorithms for analyzing large data sets, with an emphasis on the explicit and implicit geometric/topological aspects. I’ve chosen topics based on what I’ve heard/read most about, so I hope that these will be relatively representative topics, but I make no guarantees. Later, I’ll switch to recent approaches that explicitly employ ideas from topology. I’ll start, in this post, with artificial neural netorks, which were one of the earliest approaches to machine learning/artificial intelligence. This algorithm is a good representative of a *gradient method*, i.e. an algorithm that iteratively improves a model for the data by following the gradient of some function.

## April 11, 2012

### Big Data and the Topologist

As I mentioned in my previous post, I plan to write a series of posts about study of large data sets, both the ways that high dimensional data has traditionally been studied and the topology that has recently been applied to this area. For anyone who has experience thinking about abstract geometric objects (as I assume most of the readers of this blog do) the concepts should seem pretty straightforward, and the difficulty is mostly in translation. So I will start with a post that focusses on defining terms. (Update: I’ve started a second blog The Shape of Data to look into these topics in more detail.)