I’ve found it surprisingly hard to learn about machine learning, including neural networks, from a mathematical perspective because most of the literature on the subject is (understandably) targeted towards future practitioners coming from an undergraduate computer science background, rather than towards mathematicians. It spends a lot of time carefully introducing basic mathematical concepts, goes very deep into subtle technical details and relies heavily on computer science notation/terminology with a bit of statistics terminology and notation thrown in. So even though many of the concepts are very closely related to ideas that we pure mathematicians know well, you still have to read through all the details to get to those concepts. It turns out it’s very hard to learn the notation and terminology of a subject without also starting from scratch (re)learning the concepts.

So I’ve been doing that in fits and starts for the last few years. When I started writing the Shape of Data blog, I focussed on extracting interesting undergraduate mathematics ideas from the very introductory ideas of the field because that’s as far as I had gotten into it. But after spending a few years writing that blog, then spending a few more years not writing but continuing to learn, I think I have a deep enough understanding to be able to extract some abstract ideas that are mathematically complex enough to be interesting to the audience of this blog. So that’s what this post is about.

At its core, machine learning is about making predictions, so we have to start by defining what that means. We’ll start with a problem called *memorization* that isn’t exactly what we’re looking for, but is pedagogically important.

**The memorization problem:** Given a finite set of points , find a function such that for each , .

The function is called a *model* and the interpretation of this problem is that each is a vector extracted from data about an instance of something that you want the model to make a prediction about. The is from data that you know before-hand. The is from the data that you want to predict. So for example, may be information about a person, and may be a way of encoding where they’ll click on a web page or how they’ll respond to a particular medical treatment.

The memorization problem is not a very interesting math problem because it’s either trivial and underspecified (if all the s are distinct) or impossible (if there are two points with the same and different s.) It’s also not a very good practical problem because it only allows you to make predictions about the data that you used to construct the function . So the real statement of the prediction problem is more subtle:

**The generalization problem:** Given a finite set of points called the *training set*, find a function such that given a second set called the *evaluation set, *for each , is “as close as possible” to .

This is a much better practical problem because you’re constructing the model based on the training set – examples of the data that you’ve seen in the past – but evaluating it based on a second set of points – data that you haven’t seen before, but want to make predictions about.

However, as a mathematical problem it’s even worse than the first one. In fact, it isn’t even really a mathematical problem because there’s no logical connection between what’s given and what you have to find.

So to make it a mathematical problem, you have to add in assumptions about how and are related. There are many ways to do this, and any algorithm or technique in machine learning/statistics that addresses this problem makes such assumptions, either implicitly or explicitly.

The approach that I’m going to describe below, which leads to the definition of neural networks, makes this assumption implicitly by restricting the family of functions from which can be chosen. You solve the memorization problem for on this restricted family and the resulting model is your candidate solution to the generalization problem.

Before we go into more detail about what this mean, lets return to the notion of “as close as possible”. We’ll make this precise by defining what’s called a *cost function* such that defines the “cost” of the difference between and . A common example is . Given a dataset , the *cost* of a given model is .

With this terminology, the memorization problem is to minimize , while the generalization problem is to minimize .

Note that the cost function is a function on the space of continuous functions . We will be interested in subspaces of , and we can define one by choosing a map . This is often called *parameter space*, and we will use the symbol .

The canonical example of this is linear regression. In the case where and , we define , and we’ll let be the coordinates of . Define to be the function that takes to the function , and use the difference-squared cost I mentioned above.

Now, given and , we get a cost function on which we can pull back to a function on and choose a point that minimizes this cost function. To determine how close this particular family of models comes to solving the generalization problem compared to other families of models, we evaluate .

The nice thing about this setup is that it gives us a relatively objective way to evaluate families of models. You can often find a family with a lower minimum for by increasing the dimension of and making the set of possible models more flexible. However, if you take this too far this will eventually increase for the model that minimizes . This is called the bias/variance tradeoff, and when you go too far it’s called overfitting.

There’s also a question of how you find the minimum in practice. For linear regression the cost function turns out to have a unique critical point which is the global minimum and there’s a closed form for the solution. However, for most model families you have to use an algorithm called *gradient descent* that discretely follows the gradient of the cost function until it finds a local minimum which may or may not be global.

So rather than just adding flexibility to a model family, the trick is to add the right kind of flexibility for a given dataset, i.e. in a way that minimizes the bias/variance tradeoff and reduces the number of spurious local minima. And this is where things get interesting from the perspective of geometry/topology since it becomes a question of how to characterize the different ways that a model family can be flexible, and how to connect these to properties of a dataset.

For example, the simplest way to make linear regression more flexible is to replace the line function with a polynomial of a fixed degree. However, this doesn’t turn out to be very practical in many cases because for higher-dimensional , the number of parameters goes up exponentially with the degree. So you end up with a lot of flexibility that is either redundant, or isn’t useful for your given dataset. One reason neural networks have become so popular is that they manage to be extremely flexible with relatively few parameters, at least compared to polynomials.

In the follow-up to this post, I’ll describe how a neural network defines a family of models, and I’ll outline my recent result about topological constraints on certain of these families.

]]>Today I’d like to tell you about a preprint by Malyutin, that shows that two widely believed knot theory conjectures are mutually exclusive!

A Malyutin,

On the Question of Genericity of Hyperbolic Knots, https://arxiv.org/abs/1612.03368.

**Conjecture 1: ** Almost all prime knots are hyperbolic. More precisely, the proportion of hyperbolic knots amongst all prime knots of or fewer crossings approaches 1 as approaches .

**Conjecture 2: ** The crossing number (the minimal number of crossings of a knot diagram of that knot) of a composite knot is not less than that of each of its factors.

Both conjectures have a lot of numerical evidence to support them.

For conjecture 1, just look at the following table, cited by Malyutin from Sloane’s encyclopedia of integer sequences:

Also, many topological objects that are related to knots are generically hyperbolic- e.g. compact surfaces, various classes of 3-manifolds, closures of random braids…

For conjecture 2, two stronger conjectures are widely believed. First, that crossing number should in fact be **additive** with respect to connect sum (so the crossing number of a composite knot should in fact be the **sum** of the crossing number of its components). This has been proven for various classes of knots- alternating knots, adequate knots, torus knots, etc. Secondly, that the crossing number of a satellite knot is not less than that of its companion.

Which of these seemingly-obvious, and widely-believed, conjectures is false?? This is high drama in the making!

]]>I’m excited by a number of new and semi-new papers by Greg Kuperberg and collaborators. From my point of view, the most interesting of all is:

G. Kuperberg,

Algorithmic homeomorphism of -manifolds as a corollary of geometrization, http://front.math.ucdavis.edu/1508.06720

This paper contains two results:

1) That Geometrization implies that there exists a recursive algorithm to determine whether two closed oriented –manifolds are homeomorphic.

2) Result (1), except with the words “elementary recursive” replacing the words “recursive”.

Result (1) is sort-of a well-known folklore theorem, and is essentially due to Riley and Thurston (with lots of subsections of it obtaining newer fancier proofs in the interim), but no full self-contained proof had appeared for it in one place until now. It’s great to have one- moreover, a proof which uses only the tools that were available in the 1970’s.

Knowing that we have a recursive algorithm, the immediate and important question is the complexity class of the best algorithm. Kuperberg has provided a worst-case bound, but “elementary recursive” is a generous computational class. The real question I think, and one that is asked at the end of the paper, is where exactly the homeomorphism problem falls on the heirarchy of complexity classes:

And whether the corresponding result holds for compact –manifolds with boundary, and for non-orientable –manifolds.

]]>In “A Mathematician’s Apology”, published in 1940, G. H. Hardy argued that the study of pure mathematics could be justified entirely by its aesthetic value, independent of any applications. (He used the word “apology” in the sense of Plato’s Apology, i.e. a defense.) Of course, Hardy never had to apply for an NSF grant and his relatives probably never asked him why someone would pay him to solve problems without applications.

In the following decades, mathematics helped win the Second World War and send astronauts to the moon. Many mathematicians began to justify their work in abstract research by pointing to examples such as number theory in cryptography, where ideas from abstract mathematics that were developed based on aesthetics proved to be unexpectedly useful for real world problems. In the 1960s, as baby boomers headed off to college and PhD programs struggled to keep up with the need for new faculty members, some mathematicians began to argue that teachers who were involved in active research would be better equipped to teach students how to think mathematically.

But today, now that graduate programs produce more PhDs than can fill the available research and teaching positions, the reality has set in that most mathematics PhD students will not go on to careers that involve teaching, let alone abstract research. Moreover, the economic slowdown that followed the post-war boom has made it harder for governments to justify investments, whether in the form of grants or tenure lines, for research whose value won’t be apparent for decades or even centuries.

So the mathematics community faces a choice: either accept the new reality by cutting back PhD programs or rethink the way that abstract mathematics should fit into society.

In this post, I will argue that by changing the way we justify mathematics research and the ways we think about the role of the research community in the wider world, we can sustain or even increase graduate programs and research funding without changing our core values or the fundamentals of graduate education. I won’t attempt to distinguish between “pure” and “applied” mathematics. The term abstract research is intended to imply both. I will argue three points:

- The background one gets from a graduate degree in abstract mathematics is extremely, and increasingly, valuable in a wide range of non-academic careers, beyond the stereotypical security/military and financial sectors.
- The value of this background comes from time spent working within a large, active, academic community engaged in abstract research and is much greater than the external value of the research itself.
- Embracing this perspective will not cause a massive exodus of mathematicians from academia, but will instead cause an increase in the number and diversity of students entering graduate programs.

This perspective argues that students leaving academia for industry are the most valuable contribution that math PhD programs make to the rest of society. The changes the community would need to make in order to embrace this new perspective are not simple or easy, but they are mostly peripheral. The value of a background in mathematics comes from the way that students currently learn the ideas, research practices and thought processes. The required changes have to do with the way we recognize this value: The ways we talk to students about potential careers, the ways that we approach professional development and the ways that we talk to each other and to non-mathematicians about how our research fits into the rest of the world.

In particular, when we discuss the external value of mathematics research, we should de-emphasize the theorems we prove, and focus on the diversity of perspective that members of the research community bring to non-academic organizations. A great deal of research in the past few years has demonstrated the value of diversity in teams, and while most of the discussion has focused on ethnic and gender diversity, the same principle applies to intellectual diversity. In fact, a major benefit of ethnic and gender diversity is that it’s a proxy for diversity of perspective. Similarly, the perspective that one forms from engaging in mathematics research can be invaluable to a team of mostly non-mathematicians, not because it’s objectively better than any other perspective, but because it’s different.

While it can be hard to pin down exactly what makes a mathematical perspective different, here’s a partial list. Mathematicians are not the only people who can do these things, but engaging in abstract mathematics research trains students to do them well:

- Thinking at and between different levels of abstraction: Understanding how axioms fit together to form lemmas, then theorems, is good practice for understanding other complex systems that are too large to see all at once.
- Boiling systems down to their essentials: Abstracting systems into definitions and axioms requires determining what’s fundamental and what’s peripheral.
- Discovering parallels between unrelated systems: Solving a problem by transforming it into a previously-solved problem works in the real world too.

When combined with a bit of domain knowledge, these skills can be used to translate vague intuition into precise and usable statements, incorporate ideas from a range of perspectives and terminologies into a cohesive system and create a scaffolding that allows a team to reason about a complex system. Acquiring the domain knowledge that makes this possible is non-trivial – at the very least, it requires a number of years of working outside academia – but the mathematical perspective makes it an order of magnitude more powerful.

One can’t form a mathematical perspective from books and lectures alone. It can only come from working on abstract problems within an active research community. For a graduate student, the research problem is a lens that brings all the tools and problems of mathematics into sharp focus. Every conversation with another mathematician becomes a chance to learn how they would approach the problem. Every new idea must be understood well enough to determine whether it can be applied to help solve the problem.

These ideas, from across mathematics, are abstracted from problems in hundreds of other fields, and bring with them artifacts of the thought processes that spawned them. And while many ideas get written into papers and books, the folklore and meta-ideas that surround them are, arguably, much more important. They make the research community a living entity, and while individual mathematicians may turn coffee into theorems, it is the community that turns students into mathematicians.

Meanwhile, students are increasingly aware of the problems with the academic job market. Many promising undergraduates who love the subject never apply to graduate school because they don’t want to become a professor or don’t think they have what it takes. Moreover, women and members of underrepresented group are much more likely to make such a decision because they’re more likely to perceive that the cards are stacked against them. If they never enter graduate school, we never get the chance to convince them that they can make valuable contributions to the mathematics community.

If attitudes change so that an academic career is seen as one of many acceptable career paths that a math PhD can lead to, the research community might lose a few would-be professors, but more importantly, it will gain graduate students who love the subject more than the career path. These students will bring a much broader diversity of background and interests, which will enrich the community with new ideas. Already, many PhD graduates choose non-academic careers late in the process, after they discover how limited their academic career options are. If they can make these decisions sooner, it will benefit them individually and the math community as a whole.

Changing the way the research community thinks about career paths and its relationship with the outside world will not be simple or easy, and a prescription for such change is far beyond the scope of this post. However, they won’t require changing the fundamentals of how we create and teach mathematics, since these are the things that make a math research background so valuable. We should help students to learn about non-academic careers and how mathematicians can fit into them. We should not push students into applied math and statistics courses or make them into “data scientists”. The value that a mathematician brings to a non-academic career comes from engaging with the mathematics community on an abstract dissertation problem. Today, the math PhDs who follow non-academic career paths are individually demonstrating that value. All we, as a community, need to do is find better ways to recognize it.

]]>A new book has just come out, and it’s very good.

Office Hours with a Geometric Group Theorist, Edited by Matt Clay & Dan Margalit, 2017.

An undergraduate student walks into the office of a geometric group theorist, curious about the subject and perhaps looking for a senior thesis topic. The researcher pitches their favourite sub-topic to the student in a single “office hour”.

The book collects together 16 independent such “office hours”, plus two introductory office hours by the editors (Matt Clay and Dan Margalit) to get the student off the ground.

Given the number of authors and the variety of concepts that are presented, trying to assemble such a book would seem a recipe for disaster, but the actual result is a resounding success! The level never flies off into the stratosphere and never becomes patronizingly oversimplified – each office hour is at the right level, and the tone remains informal without being wishy-washy. As the researcher is aiming to hook students on their topic, each office hour provides a nice entry point into its topic, with “next steps” mapped out to help the student on their way.

The voice of the researcher is preserved, which is also nice. Aaron Abrams informs us that Thalia’s hair (presumably his daughter) and challah are both braided, and Johanna Mangahas explains the Ping-pong Lemma using ping-pong.

The greatest highlight of the book is perhaps the exercises, which are pitched at a good introductory level and help the student wrap their brains around the topic.

I think that the book isn’t only a collection of good lead-ins for undergraduates- these “office hours” are equally useful for graduate students and for mathematicians who don’t happen to specialize in those fields, but who want to sightsee some key ideas quickly.

I very highly recommend it!!

]]>Mirror symmetry is a physical idea that relates two classes of problems:

**A-Model:**Measurement of a “volume” of a moduli space. In particular, counting the number of points of a moduli space that is a finite set of points.**B-Model:**Computation of matrix integrals.

We may think of the A-model as “combinatorics and geometry” and of the B-model as “complex analysis”. Why might relating these classes of problems be important?

- Mirror symmetry might help us to compute a quantity of interest that we would not otherwise know how to compute. Sometimes enumeration may be simpler (e.g. the Argument Principle) and sometimes complex analysis may be simpler (when integrating by parts is easier than counting bijections).
- An object in one model may readily admit an interpretation, whereas its mirror dual’s meaning may be a mystery. This is the case in quantum topology- quantum invariants, which live on the B-model side, are powerful, but their topological meaning is a mystery. On the other hand, the A-model invariants (hyperbolic volume, A-polynomial) have readily understood geometric/topological meaning.

Mirror symmetry (as currently understood) doesn’t in-fact directly solve either problem, but it does provide heuristics. There is no known formula to compute the mirror dual problem to a given problem- mirror duals in mathematics have tended to be noticed post-facto. Mirror symmetry is also not mathematically rigourous, so each prediction of mirror symmetry must be carefully analyzed and proven. In addition, the mathematical meaning of mirror symmetry is unclear.

Despite this, quantum topology has received a number of Fields medals for work in and around mirror symmetry, including Jones (1990), Witten (1990), Kontsevich (1998), and Mirzakhani (2014). Several of our most celebrated conjectures, such as the AJ conjecture relating a quantum invariant to a classical invariant, stem from it.

Topological recursion observes that all known B-model duals of A-model problems can be framed in a common way (a holomorphic Lagrangian immersion of an open Riemann surface in the contangent bundle with some extra structure). This was observed first in special cases, and then it was noticed that the picture generalizes. Topological recursion thus reveals a common framework to all known mathematical examples of mirror symmetry. This simplifies B-model duals to A-model problems and places them in a common framework (a-priori they are complex integrals with a lot of variables without much else in common). It also provides tools to prove mirror duality in special cases. Explicitly, all of the information of an a-priori complicated mirror dual can be recovered from an embedded open Riemann surface (plus some extra structure), whose information is again encapsulated via an explicit formula in information in lower genus surfaces. Together with Mariño’s Remodeling Conjecture, we can say that topological recursion “tidies up” the B-model side of mirror symmetry, and elucidates what it means for something to be a “B-model dual” do an “A-model problem”.

One insight which topological recursion provides is that many of the simplest cases of mirror symmetry are Laplace transforms. Perhaps this is a window to understanding mirror symmetry itself? An vague conjecture along the lines of “in some contexts, mirror symmetry and the Laplace transform are the same thing in disguise” is given by Dumitrescu, Mulase, Safnuk, and Sorkin. For quantum topologists, another insight provided by topological recursion is that it suggests ways of reframing our favourite quantum invariants, such as the Jones polynomial, as objects which have more ready topological meaning, such as tau functions of integrable systems.

So, in conclusion, topological recursion provides a common framework for B-model objects such as quantum invariants. The hope is that this will elucidate their meaning and facilitate proving their mirror duality to better-understood mathematical objects. It does this by tidying up the B-model side into something structured which begins to look tractable.

Topological recursion has already led to several breakthroughs, including the simplest known proof of Witten’s conjecture and of Mirzakhani’s recurrence, and the subject is still in its infancy. It fits well with what we know by recovering all the “right” invariants at low orders (hyperbolic volume, analytic torsion) and hitting some heuristically expected keywords (e.g. ). Topological recursion is white-hot at the moment.

Disclaimer: I’m not an expert and some things I said might be wrong- please correct mistakes, inaccuracies, and omissions in the comments!

]]>Step 1: Diff(S^2) has the homotopy-type of O_3 x Diff(D^2,S^1). The latter object here is the group of diffeomorphisms of the 2-disc which are the identity on the boundary.

Step 2: Show Diff(D^2,S^1) is contractible.

Step 1 is a general argument, that Diff(S^n) has the homotopy-type of O_{n+1} times Diff(D^n, S^{n-1}), the proof of which is very much in the spirit of the isotopy extension theorem, and the classification of tubular neighbourhood theorem, but `with parameters’.

Step 2 is a rather specific argument, which, at its core involves the meatiest theorem on our understanding of first-order ODEs in the plane: the Poincare-Bendixson theorem. His clever application of Poincare-Bendixson theorem allows him to reduce the proof to the theorem that Diff(D^1,S^0) is contractible, which has many simple and elegant proofs.

Smale’s proof has a bit of the spirit of an inductive proof. It leads one to the question, what about the homotopy-type of Diff(S^3)? Perhaps because we can’t imagine anything different, it would make sense for Diff(S^3) to have the homotopy-type of O_4. At the level of path-components this was proven by Cerf in 1968, in one of the first applications of the subject now called Cerf Theory. The full proof by Allen Hatcher was given in 1983. Around this time the problem of showing Diff(S^3) has the homotopy-type of O_4 began to be called “The Smale Conjecture”.

I think it’s fair to say that most major theorems in 3-manifold theory at present have several different proofs (classification Seifert-fibred manifolds with 3 singular fibres over S^2 might be one of the few cases where there is only one proof), or at least, several variations on one proof. But the Smale Conjecture has found no alternate proofs. People have hoped that perhaps a `geometrization with parameters’ theorem could be used on the space of all metrics on S^3, but the metric collapses along families of spheres — this is much like the difficulties Hatcher encounters in his original proof, but Hatcher was just dealing with families of manifolds, while a geometrization proof would be in a category of Riemann manifolds.

Hatcher suggested a possible alternative framework to prove the Smale Conjecture. The idea is to show that the component of the trivial knot, in the space of smooth embeddings Emb(S^1, S^3) has the homotopy-type of the subspace of great circles. Hatcher gave a few other equivalent formulations of the Smale Conjecture — the one he used makes the Smale Conjecture looks like `the Alexander Theorem with parameters’ i.e. that the space of smooth embeddings Emb(S^2, R^3) has the homotopy-type of the subspace of (parametrized) round spheres. Hatcher’s proof is essentially a souped-up version of Alexanders proof; roughly speaking it involves a rather careful cutting of families of spheres into simpler families.

The embedding space Emb(S^1, S^3) has been studied in many ways over the years. Jun O’Hara had the idea of putting a “potential function” on this space, much in the spirit of Morse theory. He used a function derived (in spirit) from electrostatics. Imagine the knot as carrying a uniform electric charge along its length and write down the integral for the potential energy of the system. Technically O’Hara allowed for less physically inspired “energies” but this is the basic idea. In the 80’s and 90’s it was proven that for O’Hara’s potential function flow in the negative gradient direction makes sense, and that there are local minimizers in the space. Recently it was proven that a C^1 embedding which is a critical point of this energy functional, is necessarily a C^\infty smooth embedding. So there has been plenty of progress.

Of course, what one would really want to prove is that the only critical points of this functional on the component of the trivial knot are the great circles themselves. That would allow for a Morse-theoretic argument that the unknot component of Emb(S^1, S^3) has the homotopy-type of the great-circle subspace, and give a new, rather appealing proof of the Smale conjecture.

References:

O’Hara. Energy of a knot, Topology, 30 (2): 241–247

Freedman, He, Wang. Möbius energy of knots and unknots, Annals of Mathematics, Second Series, 139 (1): 1–50

He. The Euler-Lagrange equation and heat flow for the Möbius energy. Communications in Pure and Applied Mathematics.

Blatt, Reiter, Schikorrra. Harmonic Analysis Meets Critical Knots. TAMS Vol 368, no 9, sept 2016, pg 6391–6438

]]>

We’re looking toward developing applications, so we’re primarily searching for people who can program and maybe who have some signal processing knowledge. So primarily for computer science postdocs, I suppose.

An official announcement will be posted at relevant places in due time- but you heard it here first (^_^)

]]>