For the next few parts in my continuing series on the geometry/topology of high dimensional data, I plan to write about some of the classical approaches and algorithms for analyzing large data sets, with an emphasis on the explicit and implicit geometric/topological aspects. I’ve chosen topics based on what I’ve heard/read most about, so I hope that these will be relatively representative topics, but I make no guarantees. Later, I’ll switch to recent approaches that explicitly employ ideas from topology. I’ll start, in this post, with artificial neural netorks, which were one of the earliest approaches to machine learning/artificial intelligence. This algorithm is a good representative of a gradient method, i.e. an algorithm that iteratively improves a model for the data by following the gradient of some function.
The artificial neural network algorithm is modeled on the way neurons in the brain work. Neurons send impulses back and forth, with each neuron deciding whether or not to fire based on the impulses it has received from nearby neurons. An artificial neural network (ANN) is encoded as a finite graph/network with directed edges/links, in which each vertex/node represents a neuron. The edges pointing into each vertex are its inputs and the edges pointing away are its outputs. Each vertex calculates a weighted sum of its inputs, and transmits along its output edges the value of some function of this weighted sum. For the sake of simplicity, we will assume that the graph does not have any directed loops, but this is not a completely necessary assumption.
A neural network can answer questions as follows: We assume one set of nodes doesn’t have any inward pointing edges, and we have to assign to them values to transmit. The assigned values come from the input vector. We will also designate some nodes as output nodes. We can then traverse the network, calculating the value that each node transmits based on the weighted sum of values transmitted by its inputs. After we make it through the whole graph, we read off the values transmitted by the output nodes, and this vector is our answer.
As mentioned above, a node doesn’t just output the weighted sum of its inputs but a function of that sum. This is commonly a step function that, say, returns 1 if the weighted sum of inputs is greater than 1 and returns 0 if the sum is less than or equal to 1. However, in a minute we’ll want this function to be differentiable, so we’ll actually choose a smooth approximation of this step function. (There are a number of different formulas one could use, and for now, it doesn’t matter which one you prefer, as long as its derivative is always positive.)
So once we pick the weights on the edges, the neural network will be able to answer questions, i.e. take input in the form of vectors and output a set of values, each very close to 0 or 1. To get the network to produce the values that we want, we have to choose the correct weights. We’ll assume that we are given a collection of vectors, and each vector is labeled with the desired output. For simplicity, we will assume there is a single output value. For example, each vector could be statistics measured from a tumor, and the desired output is 0 or 1 depending on whether the tumor turned out to be malignant or benign.
We’ll start by choosing a random set of weights on the edges and considering the first labeled vector . If we choose different edge weights, the different outputs produced when we input defines a function on the the space of all possible edge weights. We could pick a completely new set of weights where the output is exactly what we want, but this would ignore the fact that we want the output to be correct for all the data points that come after as well. In particular, could be an outlier and have nothing to do with the structure of the set.
Instead, we want to just nudge the weights in the right direction, then nudge them again for the next data point and so on until we start getting reasonably correct results. Since we chose a differentiable approximation of a step function at each vertex, this function on the weight space will be differentiable, and its gradient will tell us the most efficient direction in which to push the weights. (This is the gradient in the term gradient method.) If we run through the labeled data a number of times and modify the weights for each point, we would hope to get a reasonably good set of weights that would also give us correct answers for new data. (Practitioners generally split the data into one set for determining the weights and a second set for verifying that these weights return reasonable results, in order to avoid overfitting.)
OK, but what does this have to do with geometry/topology? Well, a difficult problem for neural networks is choosing a good graph – It has to be complex enough to differentiate between the given sets (benign/malignant, etc.) but still simple enough to be computationally reasonable. So we have to figure out what geometric/topological possibilities are allowed by different networks.
We’ll start by looking at the simplest possible neural network – input vertices, each sending output to one output vertex. What sets of input vectors can it differentiate between? Let be the input values and be the weights on the input edges. The vertex outputs (a value close to) 1 precisely when , so it splits the space of input vectors along a (hyper)plane. In other words, no matter what weights are chosen, it can only differentiate between two sets if they are separated by a hyperplane. Two such sets are called linearly separable.
A slightly more complicated neural network has the input nodes, all feeding into more nodes, with the nodes feeding into a final output node. Each of the first nodes outputs either 0 or 1 (or rather, a value close to 0 or 1), so the final node outputs 1 whenever a certain number of the middle nodes send it a 1. Since the nodes pick out hyperplanes, this means that the final node picks out a (possibly concave, possibly infinite) polyhedron. (This is where the step function is important. For example, if we chose a linear function on each of the middle nodes, then we would just get another hyperplane for the whole network.)
More complicated neural networks will produce more complicated polyhedral structures, and determining how the graph structure of the network corresponds to different polyhedral structures is a potentially very interesting problem. Conversely, determining the smallest neural network that can differentiate between sets with a given topological type or homotopy type could also be interesting.
One can generalize the network in a number of different ways, particularly once the geometry is understood. For example, make the value transmitted by each of the middle nodes be the Gaussians () of the distance from the input vector to the vector defined by the weights on the input edges. This would replace the half-hyperplane defined by a step function with a ball centered at the point defined by the weights. With a few additional modifications, this is the basis for the Self-Organizing Map commonly used in manifold learning. (The SOM algorithm was first defined in terms of a modified neural network, called a Kohonen network after its inventor, but has since been translated into a context independent of neural networks.) With Gaussian functions on the nodes, following the gradient corresponds to moving these balls around in order to optimally cover the data.