In my last two posts, I described two approaches to high dimensional data analysis that are very linear: neural networks implicitly use planes to divide data sets into subsets, while principle component analysis looks for the best linear projection of a data set. (Both were part of my ongoing series on the topology of big data.) In practice, data sets often don’t conform to linear models, but there is a common method for trying to work around this: a kernel is a non-linear embedding of the original vector space into a higher dimensional vector space, with the goal of making the image set more compatible with linear algorithms. (This is unrelated to the term kernel used in algebra.)
As an example, assume we have a set of points in the plane with coordinates and we would like to distinguish between two subsets that happen to be separated by the curve . This is not linear, so for example a neural network would need a large number of nodes to even approximate the two sets. However, once we saw that the neural network wasn’t working, we might try embedding the points in a five dimensional vector space by the kernel . Under this embedding, the curve becomes the hyperplane , which can be described by a very simple neural network, or easily handled by a host of other algorithms. This is a polynomial kernel, but there are many other types of kernels that have been defined and studied.
In practice, most algorithms don’t actually calculate the images of the points under the kernel, but rather use the kernel to modify the metric or inner product, etc. used by the algorithm. So in some sense a kernel is effectively just a non-linear metric on the vector space. However, it is often important for the algorithms that this metric comes from an embedding into a vector space, plus it can be conceptually useful.
Being able to use a kernel doesn’t immediately solve the problem of understanding high dimensional data – it only redefines it. A kernel only works if we find the right one, which means that we still need to understand something about the structure of the data (or make a really lucky guess.) An algorithm called a support vector machine measures the separation between two linearly separable sets and can be used to compare different kernels or to tune the parameters of a kernel. But there are far too many kernels and parameters to try every single one. So using kernels turns data mining into an iterative process in which an analyst builds a progressively more accurate picture of the data set by many different means. Even if the final goal is a kernel on top of a relatively simple linear algorithm, a large set of tools is needed and any information about the set, including intuitive topological information, can come in handy. In my next few posts, I’ll write about some of the topological methods that can help to understand the structure of a data set.