In Part 1 of this series, I gave an abstract description of one of the main problems in Machine Learning, the Generalization Problem, in which one uses the values of a function at a finite number of points to infer the entire function. The typical approach to this problem is to choose a finite-dimensional subset of the space of all possible functions, then choose the function from this family that minimizes something called *cost function*, defined by how accurate each function is on the sampled points. In this post, I will describe how the regression example from the last post generalizes to a family of models called Neural Networks, then describe how I recently used some fairly basic topology to demonstrate restrictions on the types of functions certain neural networks can produce.

First lets recall the setup: We have a finite set of points and we want to find a function such that for each , the distance from to in is “as small as possible”. We do this by choosing a map , then picking the point in whose image minimizes a previously chosen cost function. In the last post, we also had a second set that we used to evaluate how well we did, but we won’t need that in this post.

The basic example of this is linear regression where , so and are both one-dimensional, and is spanned by the variables . This can be generalized to higher-dimensional and by replacing the scalars by a matrix and a vector, respectively. This is higher-dimensional linear regression.

Neural networks define a very general framework for defining other families of functions. One goal of this post is to describe a large (but not complete) chunk of this framework. But before we do that, we need to go from linear regression to logistic regression.

Logistic regression is used for prediction problems where rather than predicting a value associated with a data point, we want to predict the probability that a data point is in a given class or meets some condition. So we want a function whose range is the interval rather than all real numbers.

Logistic regression accomplishes this by composing the family of functions from linear regression with the *logistic function* a scaled and translated version of hyperbolic tangent that maps the line to the interval . So if is one-dimensional, then we will still have but a point will map to the function .

In the training set for a logistic regression problem, the -value of each datapoint will be either 0 or 1. The cost function for each data point is the negative log of the difference between the predicted and actual values. This logarithm is not arbitrary – it’s the result of calculating the likelihood (in the statistical sense) of the function (interpreted as a probability distribution) and applying a logarithm to turn the multiplication in the definition of likelihood into the addition required for the cost function. Details left to the reader.

As with linear regression, we can increase the dimension of or in logistic regression. For , it’s simply a matter of changing to a vector, since we can still apply the logistic function to the one-dimensional output. To increase the dimension of , we further promote to a matrix and to a vector, then apply the logistic function independently to each output dimension. We can interpret this as each dimension of predicting a different class/condition. So logistic regression is not basis-independent in the way that linear regression is.

In the realm of machine learning, the logistic function in logistic regression is called an *activation function*. The idea is that each dimension represents a neuron and the output value represents whether or not the neuron is firing. So the activation function determines how the linear combination of values feeding into the neuron determines whether or not it should fire. (In theory, the output of the activation function should be a boolean, but then we couldn’t do gradient descent.) Another popular activation function is the Rectified Linear Unit (ReLU) .

Now that we know what logistic regression is and what activation functions are, we can define a large family of neural networks by simply composing a chain of (higher-dimensional) logistic regressions. Each regression is called a *layer*. The outputs dimensions have to match up to the input dimensions, and the overall parameter space of the composition is the direct product of the parameter spaces of the layers. We can use the same activation functions between layers, or different ones.

That’s it.

The reason this is called an (artificial) neural network is because we can think of the output dimensions of each layer as being neurons that are connected to neurons in the next layer. The linear part of each logistic regression defines the input to each neuron as a weighted sum of the previous neurons’ outputs. For this reason, the matrix is often called a *weight matrix* and the values are called *weights*. Every layer except the last one is called a *hidden layer*.

Note that the weights, which make up the parameters of the function family, only modify the linear steps in the neural network, while the activation functions remain fixed. However, without the non-linear activation functions the neural network would just be a composition of linear functions which would just produce a single linear function.

As I noted, this is a large family of neural networks, but the concept is much more general. There are neural networks where there are connections between non-consecutive layers, or where the neurons aren’t in layers at all. There are networks where certain weights are constrained to be the same as each other (such as convolutional neural networks) and networks that take a sequence of inputs (Recurrent Neural Networks and Long Short-Term Memory Machines).

But the family I defined above is where one typically starts, and that’s what I’m going to focus on for the last few paragraphs of this blog post.

As with linear or logistic regression, a neural network defines a finite-dimensional family of functions. But unlike with regression, there isn’t an easy way to characterize these functions. It was proved in 1989, by three independent groups, that if you have one hidden layer, but allow it to have arbitrarily high dimension, you can approximate any continuous function to an arbitrarily small epsilon on any compact subset.

However, there’s an open question of whether something similar is true if you restrict the dimension but allow arbitrarily many hidden layers. The preprint I mentioned in the last post proves that for neural networks with a one-dimensional output, if the hidden layers are only allowed to have dimension less than or equal to the input dimension, there are certain functions that can’t be approximated, regardless of the number of layers. In particular, every component of every level set in such a function must be unbounded.

The main observation in the proof is that for a neural network with a one-dimensional output, if the hidden layers are all the same dimension, the weight matrices are all non-singular and the activation function is one-to-one, then the composition of all the functions up to right before the last linear function will be one-to-one too. That last linear function is just a projection onto the line, so the preimage of any point into the last hidden layer is a hyperplane. Since the activation functions may not be onto, the preimage of this hyperplane in the one-to-one map may not be a topological hyperplane, but it will be an unbounded subset of the domain.

But the Theorem also applies to networks with lower-dimensional hidden layers, and doesn’t make assumptions about the weight matrix. That’s because these functions are all in the closure of the previous set so there’s a limit argument to show that their level sets also have to be unbounded. In fact you can also apply the limit argument if the activation function is a uniform limit of one-to-one functions, like the ReLU. The proof is the type of argument you might see in an undergraduate topology class.

So as with many examples of applied math, the mathematics of this result isn’t particularly complex. What makes it interesting is the connection to ideas that are studied elsewhere. It also suggests a new set of problems that could lead to more interesting math, namely the question of how to characterize finite-dimensional families of functions such as those defined by neural networks.

## Leave a Reply