Low Dimensional Topology

May 27, 2012

Making linear data algorithms less linear: kernels

Filed under: Data topology — Jesse Johnson @ 7:06 am

In my last two posts, I described two approaches to high dimensional data analysis that are very linear: neural networks implicitly use planes to divide data sets into subsets, while principle component analysis looks for the best linear projection of a data set. (Both were part of my ongoing series on the topology of big data.) In practice, data sets often don’t conform to linear models, but there is a common method for trying to work around this: a kernel is a non-linear embedding of the original vector space into a higher dimensional vector space, with the goal of making the image set more compatible with linear algorithms. (This is unrelated to the term kernel used in algebra.)

As an example, assume we have a set of points in the plane with coordinates $(x,y)$ and we would like to distinguish between two subsets that happen to be separated by the curve $y = x^2$. This is not linear, so for example a neural network would need a large number of nodes to even approximate the two sets. However, once we saw that the neural network wasn’t working, we might try embedding the points in a five dimensional vector space by the kernel $(x,y) \rightarrow (x, y, x^2, y^2, xy) = (a, b, c, d, e)$. Under this embedding, the curve $y = x^2$ becomes the hyperplane $b = c$, which can be described by a very simple neural network, or easily handled by a host of other algorithms. This is a polynomial kernel, but there are many other types of kernels that have been defined and studied.

In practice, most algorithms don’t actually calculate the images of the points under the kernel, but rather use the kernel to modify the metric or inner product, etc. used by the algorithm. So in some sense a kernel is effectively just a non-linear metric on the vector space. However, it is often important for the algorithms that this metric comes from an embedding into a vector space, plus it can be conceptually useful.

Being able to use a kernel doesn’t immediately solve the problem of understanding high dimensional data – it only redefines it. A kernel only works if we find the right one, which means that we still need to understand something about the structure of the data (or make a really lucky guess.) An algorithm called a support vector machine measures the separation between two linearly separable sets and can be used to compare different kernels or to tune the parameters of a kernel. But there are far too many kernels and parameters to try every single one. So using kernels turns data mining into an iterative process in which an analyst builds a progressively more accurate picture of the data set by many different means. Even if the final goal is a kernel on top of a relatively simple linear algorithm, a large set of tools is needed and any information about the set, including intuitive topological information, can come in handy. In my next few posts, I’ll write about some of the topological methods that can help to understand the structure of a data set.

1. Hi Jesse, I am curious if you could give us a sampler of how topology comes into the picture with these tools.

Cheers, and keep up the nice work!

Comment by Jonas — June 1, 2012 @ 2:07 pm

• Argh, that’s what happens when you don’t read till the very end. Sorry, saw the bit where you mentioned that you’ll discuss the topological methods in the upcoming posts. Looking forward to it!

Comment by Jonas — June 1, 2012 @ 2:08 pm

• No worries. Thanks for the comments!

Comment by Jesse Johnson — June 2, 2012 @ 9:13 am

2. Aww I was hoping you’d go into more detail with an intuition about the geometry of infinite-D kernels (like you did with how multiple hidden layers in an ANN form a polytope), my current level of familiarity is (for example) Gaussian kernel==exp of a quadratic==infinite expansion==???==arbitrary nonlinear function representability.

Comment by Sanjeev Bhaskar — March 30, 2013 @ 12:04 pm

• The Gaussian kernel is a rather tough one, but here’s the rough idea. (I’ll go into more detail in a future post on Shape of Data): This kernel maps the space of points in our original feature space to the space of all functions on the feature space. (The space of all functions is an infinite dimensional vector space called a Banach space.) The kernel maps each point to a function that looks like a Gaussian blob around the point. To use the kernel, we need to know what a dot product looks like in the new space, and in this case it’s essentially the same as a dot product in a normal vector space (multiply and add) but since we’re in an infinite dimensional space, rather than adding we integrate. So the dot product of two functions $f$ and $g$ is the integral of the product $f \cdot g$.

So, for example if we want to know what a hyperplane looks like in this space, we just recall that a plane is defined by choosing a fixed vector and taking all the vectors whose dot product with this fixed vector is a given value. In the kernel space, a vector is a function, and the set of Gaussian blobs whose dot product is a given value will be (roughly) the fixed height contours of the function, i.e. all the point at which the function takes a given value. So if the function is a sum of weighted Gaussian blobs, you should be able to picture roughly what it looks like. (I don’t know how to describe it!)

Comment by Jesse Johnson — March 31, 2013 @ 9:49 am

• Based on your suggestion, I just wrote a post about Gaussian kernels on my Shape of Data blog: http://shapeofdata.wordpress.com/2013/07/23/gaussian-kernels/

Comment by Jesse Johnson — July 23, 2013 @ 5:32 am

Blog at WordPress.com.