Low Dimensional Topology

October 21, 2018

The Topology of Neural Networks, Part 1: The Generalization Problem

Filed under: Uncategorized — Jesse Johnson @ 10:21 am

I gave a talk a few months ago at the Thompson-Scharlemann-Kirby conference about a theorem I recently proved about topological limitations on certain families of neural networks. Most of the talk was a description of how neural networks work in terms of more abstract mathematics than they’re typically described in, and I thought this was probably a good thing to write up in a blog post. I decided to split the post into two parts because it was getting quite long. So the first post will describe the general approach to defining Machine Learning models, and the second post will cover Neural Networks in particular.

I’ve found it surprisingly hard to learn about machine learning, including neural networks, from a mathematical perspective because most of the literature on the subject is (understandably) targeted towards future practitioners coming from an undergraduate computer science background, rather than towards mathematicians. It spends a lot of time carefully introducing basic mathematical concepts, goes very deep into subtle technical details and relies heavily on computer science notation/terminology with a bit of statistics terminology and notation thrown in. So even though many of the concepts are very closely related to ideas that we pure mathematicians know well, you still have to read through all the details to get to those concepts. It turns out it’s very hard to learn the notation and terminology of a subject without also starting from scratch (re)learning the concepts.

So I’ve been doing that in fits and starts for the last few years. When I started writing the Shape of Data blog, I focussed on extracting interesting undergraduate mathematics ideas from the very introductory ideas of the field because that’s as far as I had gotten into it. But after spending a few years writing that blog, then spending a few more years not writing but continuing to learn, I think I have a deep enough understanding to be able to extract some abstract ideas that are mathematically complex enough to be interesting to the audience of this blog. So that’s what this post is about.

At its core, machine learning is about making predictions, so we have to start by defining what that means. We’ll start with a problem called memorization that isn’t exactly what we’re looking for, but is pedagogically important.

The memorization problem: Given a finite set of points $D \subset X \times Y = \mathbf{R}^n \times \mathbf{R}^m$, find a function $f: X \rightarrow Y$ such that for each $(x, y) \in D$, $f(x) = y$.

The function $f$ is called a model and the interpretation of this problem is that each $(x, y)$ is a vector extracted from data about an instance of something that you want the model to make a prediction about. The $x$ is from data that you know before-hand. The $y$ is from the data that you want to predict. So for example, $x$ may be information about a person, and $y$ may be a way of encoding where they’ll click on a web page or how they’ll respond to a particular medical treatment.

The memorization problem is not a very interesting math problem because it’s either trivial and underspecified (if all the $x$s are distinct) or impossible (if there are two points with the same $x$ and different $y$s.) It’s also not a very good practical problem because it only allows you to make predictions about the data that you used to construct the function $f$. So the real statement of the prediction problem is more subtle:

The generalization problem: Given a finite set of points $D_T \subset X \times Y$ called the training set, find a function $f: X \rightarrow Y$ such that given a second set $D_E \subset X \times Y$ called the evaluation set, for each $(x, y) \in D_E$, $f(x)$ is “as close as possible” to $y$.

This is a much better practical problem because you’re constructing the model based on the training set – examples of the data that you’ve seen in the past – but evaluating it based on a second set of points – data that you haven’t seen before, but want to make predictions about.

However, as a mathematical problem it’s even worse than the first one. In fact, it isn’t even really a mathematical problem because there’s no logical connection between what’s given and what you have to find.

So to make it a mathematical problem, you have to add in assumptions about how $D_T$ and $D_E$ are related. There are many ways to do this, and any algorithm or technique in machine learning/statistics that addresses this problem makes such assumptions, either implicitly or explicitly.

The approach that I’m going to describe below, which leads to the definition of neural networks, makes this assumption implicitly by restricting the family of functions from which $f$ can be chosen. You solve the memorization problem for $D_T$ on this restricted family and the resulting model is your candidate solution to the generalization problem.

Before we go into more detail about what this mean, lets return to the notion of “as close as possible”. We’ll make this precise by defining what’s called a cost function $c: X \times Y \times Y \to \mathbf{R}$ such that $c(x, y, f(x))$ defines the “cost” of the difference between $f(x)$ and $y$. A common example is $c(x, y, f(x)) = (y - f(x))^2$. Given a dataset $D$, the cost of a given model is $c_D(f) = \sum_{(x, y) \in D} c(x, y, f(x))$.

With this terminology, the memorization problem is to minimize $c_{D_T}(f)$, while the generalization problem is to minimize $c_{D_E}(f)$.

Note that the cost function is a function on the space $C^0(X, Y)$ of continuous functions $X \to Y$. We will be interested in subspaces of $C^0(X, Y)$, and we can define one by choosing a map $\mu: \mathbf{R}^k \to C^0(X, Y)$. This $\mathbf{R}^k$ is often called parameter space, and we will use the symbol $\pi = \mathbf{R}^k$.

The canonical example of this is linear regression. In the case where $X = \mathbf{R}$ and $Y = \mathbf{R}$, we define $k = 2$, and we’ll let $m, b$ be the coordinates of $\pi = \mathbf{R}^2$. Define $\mu$ to be the function that takes $(m, b)$ to the function $f(x) = mx + b$, and use the difference-squared cost $c(x, y, f(x)) = (y - f(x))^2$ I mentioned above.

Now, given $D_T$ and $D_E$, we get a cost function $c_{D_T}$ on $C^0(\mathbf{R}, \mathbf{R})$ which we can pull back to a function on $\pi$ and choose a point $(m, b)$ that minimizes this cost function. To determine how close this particular family $\mu$ of models comes to solving the generalization problem compared to other families of models, we evaluate $c_{D_E}(f)$.

The nice thing about this setup is that it gives us a relatively objective way to evaluate families of models. You can often find a family with a lower minimum for $c_{D_T}$ by increasing the dimension of $\pi$ and making the set of possible models more flexible. However, if you take this too far this will eventually increase $c_{D_E}$ for the model that minimizes $c_{D_T}$. This is called the bias/variance tradeoff, and when you go too far it’s called overfitting.

There’s also a question of how you find the minimum in practice. For linear regression the cost function turns out to have a unique critical point which is the global minimum and there’s a closed form for the solution. However, for most model families you have to use an algorithm called gradient descent that discretely follows the gradient of the cost function until it finds a local minimum which may or may not be global.

So rather than just adding flexibility to a model family, the trick is to add the right kind of flexibility for a given dataset, i.e. in a way that minimizes the bias/variance tradeoff and reduces the number of spurious local minima. And this is where things get interesting from the perspective of geometry/topology since it becomes a question of how to characterize the different ways that a model family can be flexible, and how to connect these to properties of a dataset.

For example, the simplest way to make linear regression more flexible is to replace the line function with a polynomial of a fixed degree. However, this doesn’t turn out to be very practical in many cases because for higher-dimensional $X$, the number of parameters goes up exponentially with the degree. So you end up with a lot of flexibility that is either redundant, or isn’t useful for your given dataset. One reason neural networks have become so popular is that they manage to be extremely flexible with relatively few parameters, at least compared to polynomials.

In the follow-up to this post, I’ll describe how a neural network defines a family of models, and I’ll outline my recent result about topological constraints on certain of these families.

1. Thank you – I enjoyed this post. I look forward to the follow-up.

Comment by George Constantinides — January 11, 2019 @ 9:54 am

2. I would be very interested in seeing part 2. Alternatively, do you have slides from the conference talk?

Comment by isotropy — January 22, 2019 @ 8:08 pm

• The conference talk was a blackboard talk so no slides :(

Comment by Jesse Johnson — March 2, 2019 @ 7:41 am

3. Is the follow-up still happening?

Comment by Bootstamp — February 16, 2019 @ 9:42 am

• Yes! It’s been on my todo list, and I didn’t realize how long it’s been since I wrote part 1. I should be able to post part 2 some time next week.

Comment by Jesse Johnson — March 2, 2019 @ 7:40 am

Blog at WordPress.com.