# Low Dimensional Topology

## February 28, 2016

### Nonlocality and statistical inference

Filed under: Uncategorized — dmoskovich @ 2:02 pm

It doesn’t have much to do with topology, but I’d like to share with you something Avishy Carmi and I have been thinking about quite a bit lately, that is the EPR paradox and the meaning of (non)locality. Avishy and I have a preprint about this:

A.Y. Carmi and D.M., Statistics Limits Nonlocality, arXiv:1507.07514.

It offers a statistical explanation for a Physics inequality called Tsirelson’s bound (perhaps to be compared to a known explanation called Information Causality). Behind the fold I will sketch how it works.

## Information about a parameter though a binary channel

A binary channel is a pair of Bernoulli ($0$ and $1$ valued) random variables $X$ and $Y$ representing input and output together with a conditional probability function $\mathrm{Pr}\left(\left.\rule{0pt}{13pt}X=Y\, \right| X\right)$ representing noise. A channel is typically described by telling a story about how $Y$ is constructed from $X$ and some additional random resources; but mathematically it’s really just the conditional probability function.

Usually it is a realization of $X$, i.e. a zero or a one, that is the message we would like to send through the channel. So the random variables of channels usually represent distributions of a realization. But I’d like to consider a different setting, in which the message through the channel is all of $X$. In other words, the message is the real number $\theta_X= E[X]$. The parameter $\theta_X$ contains an infinite amount of information (all values in its binary expansion, for instance), as opposed to the content of a sample that is one bit. So Bob’s “task” is to estimate the parameter $\theta_X$ to the best of his ability. To do this, he is allowed to sample $Y$ a predetermined number of times.

I would like to partition what may happen into three (realistic) cases:

1. There is no channel between $X$ and $Y$ because $\mathrm{Pr}\left(\left.\rule{0pt}{13pt}X=Y\, \right| X\right)= 1/2$. A fortiori, a finite number of samples of $Y$ tell us nothing about $X$.
2. There is a channel between $X$ and $Y$, i.e. $\mathrm{Pr}\left(\left.\rule{0pt}{13pt}X=Y\, \right| X\right)\neq 1/2$, but what is being broadcast through the channel cannot be distinguished from noise. More precisely, consider Fisher Information that is a mathematical quantity measuring how much samples of a random variable tells us about a parameter. It measures this via the Cramér-Rao Theorem, which tells us that the variance of any estimate which Bob can construct of $\theta_X$ based on the information at his disposal is bounded from below by one over the Fisher information. Our $\{0,1\}$-valued random variables have variance bounded above by $0.5$ (the variance of a Bernoulli random variable is $E[X]^2-E[X]$ whose maximum us at $0.5$), therefore Fisher information of under $2$ is no information’. Thus Bob would learn just the same about Alice’s variable by tossing a fair coin as he would learn by listening to the output of the channel.
3. Alice and Bob are communicating!

I would like to draw your attention to the second case, in which there is a channel but the information broadcast through the channel is indistinguishable from noise. The situation is analogous to a long game of Chinese whispers, in which one person whispers a message to another until the final person announces the message to the entire group. A massive such game played in 2012 resulted in “Alice’s” message “Life must be lived as play” (a paraphrase of a quote from Plato) being relayed to “Bob” as “He bites snails”. In a long enough game, with probability one, Bob will receive only noise despite a channel undeniably existing.

In a certain context, Physicists refer to Case I (nonexistence of a channel) as “Locality”, in that Alice and Bob are effectively isolated from one another. But I think that Case II is also “Locality” according to my intuitive understanding of the term. If a tree falls in a forest and no one is around to hear, does it make a sound? If a sample of $Y$ cannot be used to analyze $X$, in what sense is it paradoxical that $X$ and $Y$ are dependent?

But the word “Locality” is taken to refer to Case I, therefore I’ll refer to Cases I and II together (in the physical context I’m just about to describe) as “Information Locality”.

## Physical context

In Newtonian Mechanics an object can only be in one place at one time. An arresting feature of Quantum Mechanics is there is a sense in which an object can be located in two places at once. More precisely:

Nonlocality: A pair of quantum systems which are shown not to be physically interacting may be impossible to describe as independent entities.

Such a pair of unseparable quantum systems perforce must be described as one system system which is in two places at once. The archetype of nonlocality is a pair of distant agents Alice and Bob each of whom hold one half of a singlet. A measurement performed on Alice’s particle appears to have an instantaneous effect on Bob’s particle and vice versa. The strength of this perceived effect is quantified by a real number $-1\leq c \leq 1$ called the Bell-CHSH correlation. If $\left \vert c\right\vert\leq 0.5$ (“Bell’s Inequality”) then we are in a local setting, and Alice’s system may be fully described independently of Bob’s system, and these two systems fully describe the joint system. Bell’s Theorem tells us that Alice and Bob’s halves can no longer be described as independent entities governed exclusively by local influences when $\left\vert c\right\vert$ exceeds $0.5$.

Bell’s Theorem is proved using only Probability Theory and as such is independent of the functional analysis formalism of Quantum Mechanics. Why is this important? Besides aesthetic considerations, reliance on Probability Theory alone is good in the context of a search for a Grand Unified Theory to unify Quantum Mechanics with General Relativity. But the mathematical formalism of Quantum Mechanics (functional analysis) is different from the mathematical formalism of General Relativity (differential geometry). Thus, we would expect a grand unified theory to be described by a mathematical formalism which envelopes both of these formalisms and more, and in particular we would not expect it to be based on functional analysis.

Bell’s Inequality is indeed violated experimentally. Nonlocality is real. Newtonian mechanics alone cannot describe the quantum world.

How large can $c$ be?

Within the Hilbert-space formalism of quantum mechanics, Tsirelson showed that $\left \vert c\right\vert\leq 1/\sqrt{2}$. Tsirelson’s bound is supported experimentally.

We would love to understand Tsirelson’s bound in a broader context (e.g. probability or statistics), so that the same upper bound on $\left \vert c\right\vert$ continues to hold if and when the functional analytic formalism of Quantum Mechanics is replaced by a more abstract language.

## Channels from Bell experiments

The basic building block of the so-called “context-free approach” to nonlocality is a pair of boxes, one held by Alice and one by Bob. These boxes abstract the notion of entangled particles. Into each box you can insert either a zero or a one, and the box responds by instantaneously spitting out either a zero or a one. Call Alice’s box input $a$ and her box output $A$, and call Bob’s box input $b$ and his box output $B$. We assume various marginals such as $A|a$ and $(A+B)|ab$ to be random variables.

The Bell-CHSH correlation $c$ is now defined as the conditional probability

$(1+c)/2 \stackrel{\textup{\tiny def}}{=} \mathrm{Pr}\left(\left.\rule{0pt}{13pt}(ab=A+B)\, \right| ab\right) \enspace .$

Thus, $c$ defines a binary channel from $ab$ to $A+B$. Addition and multiplication are modulo $2$. This is kind-of weird but also kind-of cool- the channel in the Bell-CHSH setting isn’t between it’s “Alice” and its “Bob”, but rather between the product of Alice and Bob’s inputs and the sum of their outputs.

Having noted that $c$ can be used to define a channel, we can generalize to the case of multiple boxes. Each one of Alice’s boxes has a corresponding box on Bob’s side, and the coordination strength between the pair of boxes is quantified by $c$.

The classical protocol for multiple boxes is called an “oblivious transfer” and is detailed in a paper by van Dam. Alice and Bob each hold in front of them an infinite family of boxes, such that each box of Alice’s is correlated with a box of Bob’s with Bell-CHSH parameter $c$ (the same $c$ for each matching pair of boxes). Alice holds an information source which is a Bernoulli random variable $X$ with mean $\theta_X$. We imagine $\theta_X\in [-1,1]$ as encoding a message, perhaps in the digits of its binary expansion (because it’s a real number, it contains infinite information). Alice independently samples $m$ values $\mathcal{A}\stackrel{\textup{\tiny def}}{=}\left\{x_0,x_1,\ldots,x_{m-1}\right\}$ from $X$ (the interesting case is in the limit $m\rightarrow \infty$).

We specialize to the case $m=2^n$. Using the oblivious transfer protocol which takes advantage of the full power of Alice and Bob’s boxes, we compress $\mathcal{A}$ into a single bit $x^{(n)}$ which Alice send through a channel to Bob who recieves it as $y^{(n)}$. Using his boxes, Bob decompresses the bit he receives into $\mathcal{B}\stackrel{\textup{\tiny def}}{=} \left\{y_0,y_1,\ldots,y_{m-1}\right\}$ which are also independent identically distributed (iid) and which we may consider as realizations of a Bernoulli random variable $Y$ whose mean is their sample average (the variable $Y$ depends on $n$ but we suppress this from the notation). We now have a noisy channel with input $X$ and with output $Y$.

## Results

Almost all of our actual work was figuring out the reformulation above- with everything well-defined and phrased in terms of channels, the computations are routine.

A quick computation shows that $\mathrm{Pr}\left(\left.\rule{0pt}{13pt}X=Y\, \right| X\right)= (1+c^n)/2$. Thus the channel between $X$ and $Y$ disconnects in in $n\rightarrow \infty$ limit. Conversely, the Fisher information about $\theta_X$ in $\mathcal{B}$ is computed to be $\mathcal{I}_\mathcal{B}(\theta_X)=\frac{\left(2c^2\right)^n}{1 - c^{2n}\theta^2}$. This terms stays between zero and one (one instead of $1/2$ as above because our random variables are $\{\pm 1\}$-valued) for all $n\in \mathbb{N}$ only when Tsirelson’s bound holds.

[Note that we’re assuming that $\theta_X\neq \pm 1$ in the above formulae, which is essentially a technicality.]

In other words, within the context of the oblivious transfer protocol, Tsirelson’s bound is interpreted as a necessary and sufficient condition for information locality. This interprets Tsirelson’s bound entirely in terms of statistics.

Note that if Tsirelson’s bound were violated we would have a strange “Case 4” (which is actually Case I and Case III at once) in the $n\rightarrow \infty$ limit in our 3-way division. Namely, in this limit the channel would disconnect but Bob would nevertheless receive full information of $\theta_X$. We suggest that such a case ought not to occur in the real world.

## A concrete thought experiment

Informally, our result states that Bob may infer nontrivial information about $\theta_X$ if and only if Tsirelson’s bound is violated. A thought experiment which sharpens this point (but which isn’t in the present version of our preprint) is presented below.

Let’s consider the special case in which either $\theta_X=0$ (Alice samples $X$ by flipping a fair coin) or $\theta_X=1$ (Alice’s samples are either all $1$ or all $-1$), and Bob’s task is to determine which of these is the case. Is Alice sending random bits, or are her samples all the same? Say that the reality is $\theta_X=1$. Bob’s null-hypothesis $H_0$ is that $\theta_X=0$ and his alternative hypothesis $H_1$ is that $\theta_X=1$. He conducts the likelihood ratio test in an attempt to rule out the null-hypothesis. So he computes the likelihood of the null hypothesis divided by the likelihood of the alternative hypothesis- if the ratio is zero then the test succeeds, and if it is one then the test fails.

A computation shows that the likelihood ratio in this case is asymptotically $2^{-(2c^2)^n}$, so that the test succeeds in the $n\rightarrow\infty$ limit (and Bob can infer the value of $\theta_X$) if and only if Tsirelson’s bound is violated (and remember that the channel is disconnected in this case).

## Relationship to Information Causality

Our result is related to a known criterion called Information Causality. Is it sufficiently novel compared to this criterion? I can’t express an opinion… “sufficiency of novelty” isn’t well-defined. Below I describe the relationship of our work with Information Causality.

The original paper which formulated Tsirelson’s bound in terms of information was Information Causality as a Physical Principle by Pawlowski, Paterek, Kaszlikowski, Scarani, Winter, and Zukowski. In the same context as the one we work in, that paper formulates a principle called Information Causality, which roughly states that the maximum information Bob can have about Alice’s bits is $1$ because he was only sent one bit by Alice (the bit $x^{(n)}$). So Bob can infer at most the amount of information in bits that Alice actually classically sent him. Nonlocality cannot be used to construct a superluminal telegraph.

Here is a rough formulation of Information Causality:

Information Causality: The amount of information potentially available to Bob about Alice’s bits is bounded above by the number of bits Alice sends to Bob through a classical channel.

And here, beside it, is a formulation of the principle we would like to suggest instead.

Statistical No-Signaling: No information can pass through a channel whose output is independent of its input (Case I).

This is equivalent to Tsirelson’s bound via the $n\rightarrow\infty$ case of our experiment described above. Namely, the channel correlation $c^n$ converges to zero (so the channel disconnects at infinity) while the Fisher information stays bounded (in fact stays between zero and one) if and only if Tsirelson’s bound holds.

The “Information Causality quantity” is the mutual information of $\mathcal{A}$ and $\mathcal{B}$. The result of that paper is that violation of Tsirelson’s bound allows violation of Information Causality. But not the converse- they cannot prove that violation of Information Causality implies violation of Tsirelson’s bound.

Concretely, the information causality quantity is less than or equal to a term which we interpret as Fisher information, which is less than or equal to $1$ if and only if Tsirelson’s bound holds.

In the appendix to their paper:

In what sense, then, is our result more than a trivial restatement of Information Causality? Well, first of all, it’s technically a different mathematical result. It implies information causality (in the context of the protocols we both consider, anyway) but is not implied by it in any obvious way.

Perhaps I can argue that Statistical No-Signaling is more fundamental. Information causality involves a whole Alice-and-Bob story (and I’m not sure how to formulate it rigourously mathematically), whereas Statistical No-Signaling is a general statistical statement- if $X$ and $Y$ are independent random variables then you cannot learn $X$ by sampling $Y$ a countable number of times (in oblivious transfer, $Y$ is constructed via a limiting construction). Mathematically you can (if Tsirelson’s bound were violated then by oblivious transfer), but the statement is that Physically you can’t. It suggests that although we use words like Locality and No-Signaling, perhaps they shouldn’t mean what we think they mean.

Our story’ is different also. We’ve interpreted the intermediate term as Fisher information so that the task we are discussing is a statistical inference task as opposed to measuring mutual information between two strings. So, for us, Tsirelson’s bound is related to the Central Limit Theorem, by which we can characterize the convergence the sample mean of $\mathcal{B}$ to $\theta_X$ as $m$ grows. Fundamental physics relates to fundamental statistics. Aesthetically, I like that.

Because the CLT is so mathematically fundamental, many criteria can be formulated that will follow from it. Actually I think that Statistical No-Signaling might philosophically be closer to Macroscopic Locality than to Information Causality, because we’re saying that a physical system becomes local’ as the number of boxes grows to infinity. I don’t know how to rigourously derive Macroscopic Locality from our work though.

Still on the story’ front, Fisher information gives us the 3-way division discussed at the beginning of this post, interpreting the three relevant ranges of the Bell-CHSH parameter $c$. If $\left\vert c\right\vert\leq 1/2$ there is no channel. If $1/2< \left\vert c\right\vert\leq 1/\sqrt{2}$ there is a channel but its output is indistiguishable from noise. If $1/\sqrt{2}< \left\vert c\right\vert$ then there is communication through a channel. Information causality doesn’t interpret the different ranges in any meaningful way to the best of my understanding.

In addition, functionally, we have a thought experiment with a binary outcome which succeeds if and only if Tsirelson’s bound is violated, and I’m not sure how you could do that with Information Causality.

## Why I care

So why would I (D.M.) be looking at physics-y things like this?

Well, personally I think that the fundamental laws of nature ought to be distributive and nonassociative (this is an irrational bias, I know). One thing that this implies to me is that we should attempt to work as much as possible at the level of measures of information (e.g. Fisher information) rather than working in terms of vectors, functions, strings, and so on. We’ve worked on understanding information flow in a low dimensional topological context in previous work, such as arXiv:1409.5505.

I’d like to suggest the philosophical idea that joint distributions are largely a fiction. We can’t meaningfully speak about joint distributions of distant objects we can’t instantaneously compare. But marginals- conditional probabilities- are real. Nonlocality is a setting in which that is how things work. Estimators based on conditional probabilities behave like quandle elements (that’s in arXiv:1409.5505), so my dream is to link this all up to low-dimesional topology. But I’m not there yet.

## Final word

If you want to know more, please read the preprint. Feedback is welcome!!