These are notes by John Baez, Tobias Fritz and Tom Leinster. The idea is to develop a deeper understanding of entropy, free energy, the partition function and related ideas in probability theory and statistical mechanics using the tools of modern algebra: categories, monads, algebraic theories, operads and the like.
Some of these notes were turned into a paper:
This paper was developed in a series of blog conversations, in this order:
John Baez, Category-theoretic characterizations of entropy.
John Baez, Category-theoretic characterizations of entropy II.
Tom Leinster, Entropies vs. means.
Tom Leinster, An operadic introduction to entropy.
John Baez, A characterization of entropy.
The paper was written up on the nLab here:
and there is also a draft of a paper that never got finished, here:
In our paper A characterization of entropy in terms of information loss we gave a characterization of Shannon and more generally Tsallis entropy, not just for probability measures on finite sets, but for arbitrary (nonnegative) measures on these sets. This characterization uses a certain category:
Let $\Fin\Meas$ be the category whose objects are finite sets equipped with measures and whose morphisms are measure-preserving functions.
We noted that one can take arbitrary nonnegative linear combinations of objects and morphisms in $\Fin\Meas$. These can be built up from direct sums and scalar multiplication by nonnegative real numbers, defined as follows.
For ‘direct sums’, first note that the disjoint union of two finite sets equipped with measures is another thing of the same sort. We write the disjoint union of $p, q \in \Fin\Meas$ as $p \oplus q$. Then, given morphisms $f : p \to p'$, $g : q \to q'$ there is a unique morphism $f \oplus g : p \oplus q \to p' \oplus q'$ that restricts to $f$ on the measure space $p$ and $g$ on the measure space $q$.
For ‘scalar multiplication’, first note that we can multiply a measure by a nonnegative real number and get a new measure. So, given an object $p \in \Fin\Meas$ and a number $\lambda \ge 0$ we obtain an object $\lambda p \in \Fin\Meas$ with the same underlying set and with $(\lambda p)_i = \lambda p_i$. Then, given a morphism $f : p \to q$, there is a unique morphism $\lambda f : \lambda p \to \lambda q$ that has the same underlying function as $f$.
Tsallis entropy can be seen as the unique invariant of morphisms in $\Fin\Meas$ that is additive under composition, additive under direct sums, homogeneous with respect to scalar multiplication and also continuous:
Let $\alpha \in (0, \infty)$. Suppose $F$ is any map sending morphisms in $\Fin\Meas$ to numbers in $[0,\infty)$, and obeying these four properties:
Functoriality:
whenever $f,g$ are composable morphisms.
Additivity:
for all morphisms $f,g$.
Homogeneity of degree $\alpha$:
for all morphisms $f$ and all $\lambda \in [0,\infty)$.
Continuity: $F$ is continuous.
Then there exists a constant $c \ge 0$ such that for any morphism $f : p \to q$ in $\Fin\Meas$,
where $H_\alpha$ is the Tsallis entropy of order $\alpha$. Conversely, for any constant $c \ge 0$, this formula determines a map $F$ obeying conditions 1-4.
In the simplest case, namely homogeneity of degree $\alpha = 1$, the Tsallis entropy $H_\alpha$ reduces to the Shannon entropy.
In fact, such an $F$ really defines a functor from $\Fin\Meas$ to a category with one object and with nonnegative real numbers as morphisms. So we now go ahead and formulate our results in the language of category theory. For this we need to pick a way to view the nonnegative real numbers as a category:
Let $\mathbb{R}_+$ be the category with a single object $\ast$ and $[0,\infty)$ as the set of morphisms from this object to itself, with addition as composition.
The categories $\mathbb{R}_+$ and $\Fin\Meas$ have much in common, which is why it is reasonable to assign an entropy change, namely a morphism in $\mathbb{R}_+$, to each morphism in $\Fin\Meas$. Here we focus on three similarities between these categories.
First, both $\mathbb{R}_+$ and $Fin\Meas$ are topological categories. Recall that a topological category is a category where the set of objects and the set of morphisms form topological spaces, and all the category operations are continuous. A continuous functor is a functor between topological categories that maps objects to objects and morphisms to morphisms in a continuous way.
To make $\mathbb{R}_+$ into a topological category, we use the only possible topology on the set of objects, and the standard topology on the set of morphisms $[0,\infty)$. To make $\Fin\Meas$ into a topological category, we should say not only when a sequence of objects or morphisms converges, but also when any net converges. A net of objects $p_\beta \in \Fin\Meas$ converges to $p \in \Fin\Meas$ if eventually all the $p_\beta$ have the same underlying finite set $X$, and then $(p_\beta)_i \to p_i$ in the usual topology on $[0,\infty)$ for all $i \in X$. We say a net of morphisms $f_\beta: \p_\beta \to q_\beta$ converges if $p_\beta \to p$ and $q_\beta \to q$ for some $p,q\in \Fin\Meas$ and for large enough $\beta$ the underlying function of $f_\beta$ equals a fixed function $f: X \to Y$. The observant reader will note that $\Fin\Meas$ is a ‘large’ topological category, meaning that its set of objects lives in a larger Grothendieck universe than that containing the finite sets we use here. This will not present a problem.
Second, both $\mathbb{R}_+$ and $Fin\Meas$ come equipped with a notion of ‘addition’ making them into symmetric monoidal categories. For $\mathbb{R}_+$ this addition is trivial on objects, and the usual addition of obvious nonnegative real numbers on morphisms. The resulting functor
makes $\mathbb{R}_+$ into a symmetric monoidal category. On the other hand, we have already seen how disjoint union of finite sets equipped with measures defines a ‘direct sum’ operation on $\Fin\Meas$. This gives a functor
which makes $\Fin\Meas$ into a symmetric monoidal category.
Third, both $\mathbb{R}_+$ and $Fin\Meas$ can be equipped with a notion of ‘scalar multiplication’ by any number $\lambda \in [0,\infty)$. This gives an action of the multiplicative monoid $[0,\infty)$ on these categories, by which we mean monoid homomorphisms
where for a category $C$, the monoid $\End(C)$ consists of all functors $C\rightarrow C$ with functor composition as monoid multiplication.
Now let us think a bit about how the topological, additive and multiplicative structures on $\Fin\Meas$ and $\mathbb{R}_+$ interact with each other. Note that while in general on $\Fin\Meas$,
and
we do have canonical isomorphisms
and
So $[0,\infty)$ acts not just as endofunctors on $\Fin\Meas$, but as symmetric monoidal endofunctors. Writing $End_{\Sym\Mon\Cat}$ for the monoid of symmetric monoidal endofunctors of a symmetric monoidal category $C$, we hence have a monoid homomorphism
So, given a topological monoid $M$, let us call a topological symmetric monoidal category $(X, \oplus, 0)$ equipped with a monoidal functor
providing a continuous action of $M$ on $X$ an $M$-module category. We can summarize the previous two paragraphs by saying that $\Fin\Meas$ is a $[0,\infty)$-module category, and for each $\alpha \gt 0$ there is an $[0,\infty)$-module category $\mathbb{R}_+^\alpha$.
By a morphism of $M$-module categories, we mean a symmetric monoidal functor which is compatible with the action of $M$. In this way, we obtain a category of $M$-module categories; this is the coslice category over $M$ in the category of symmetric monoidal categories with strict symmetric monoidal functors.
Using these concepts, we see that Theorem 2 implies a nice characterization of Shannon entropy:
Suppose $F : \Fin\Meas \to \mathbb{R}_+$ is a morphism of $[0,\infty)$-module categories. Then for any morphism $f : p \to q$ in $\Fin\Meas$,
for some $c \ge 0$, where $H$ is Shannon entropy. Conversely, for any constant $c \ge 0$ this equation defines a morphism of $[0,\infty)$-module categories.
What about the Tsallis entropy for $\alpha \ne 1$? In fact, for each number $\alpha \gt 0$ there is a way to make $\mathbb{R}_+$ into a $[0,\infty)$-module category, by letting $\lambda \in [0,\infty)$ act trivially on objects and as multiplication by $\lambda^\alpha$ on morphisms. We write $\mathbb{R}_+^\alpha$ for $\mathbb{R}_+$ equipped with this action of $[0,\infty)$.
The action of $[0,\infty)$ on the category $\mathbb{R}_+^\alpha$ has properties already mentioned in the case $\alpha = 1$. Namely, there are equations
and
and so the action actually also operates in terms of symmetric monoidal endofunctors,
This homomorphism $A_\alpha$ is also continuous. So, for each $\alpha \gt 0$ we see that $\mathbb{R}_+^\alpha$ is a $[0,\infty)$-module. Theorem Theorem 2 implies:
Let $\alpha\in(0,\infty)$. Suppose $F : \Fin\Meas \to \mathbb{R}_+^\alpha$ is a morphism of $[0,\infty)$-module categories. Then for any morphism $f : p \to q$ in $\Fin\Meas$,
for some $c \ge 0$, where $H_\alpha$ is the Tsallis entropy of order $\alpha$. Conversely, for any constant $c \ge 0$ this equation defines a morphism of $[0,\infty)$-module categories.
Let $\Fin\Meas_{\gt 0}$ be the category of strictly positive finite measure spaces, where:
• an object is a finite set $S$ equipped with a strictly positive measure: that is, a measure $\mu$ for which every set has nonnegative measure, and the only set of measure zero is the empty set.
• a morphism $f: (S,\mu) \to (S',\mu')$ is a function such that the pushforward of $\mu$ is $\mu'$.
We define a topological category is a category internal to $Top$, and a continuous functor is a functor internal to $Top$. There is a category $Cat(Top)$ with topological categories as objects, and continuous functors as morphisms.
We shall make $\Fin\Meas_{\gt 0}$ into a topological category as follows. The topology on objects and morphisms is the obvious one except that I’m tempted to do the following: when a sequence of measures $\mu_j$ on a set $S$ is trying to converge (in the obvious sense!) to a measure $\mu$ for which only points in some subset $T \subseteq S$ have nonzero measure, we say that
In simple heuristic terms: call the points in a finite set ‘states’. Then when the probability of certain possible states goes to zero, these states effectively ‘go away’.
Note that one interesting feature of this definition is that the space of objects of $\Fin\Meas_{\gt 0}$ is connected.
Let $\mathbb{R}$ be the topological category where objects are real numbers and there is a unique morphism $f: x \to y$ whenever $x \ge y$ (and none otherwise).
Let us classify all continuous functors
with the property that
Here $\oplus$ at left denotes the ‘direct sum’ of measure spaces, while at right it denotes the usual sum of real numbers.
Every functor
restricts to a continuous functor
Here $core(\Fin\Meas_{\gt 0})$ is the core of the category $\Fin\Meas_{\gt 0}$, meaning that we keep the same objects but include only the isomorphisms. Note that this restriction loses no information. So, if we classify all continuous functors
obeying the equation $G(X \oplus Y) = G(X) + G(Y)$, the answer to our problem will be easy.
Continuous functors
obeying the equation $G(X \oplus Y) = G(X) + G(Y)$ correspond to continuous functions
with the property that $g(0) = 0$.
Proof Sketch - Since every object in $\Fin\Meas_{\gt 0}$ is a direc sum of one-element measure spaces, the functor $g$ is determined by its value on those. A one-element measure space is just a nonnegative real number, so $G$ is determined by a function
Continuity of the functor $G$ implies that $g$ is continuous. Note that $G$ applied to the empty set must give 0. Thanks to the funny topology on $\Fin\Meas$ and the continuity of $G$, this forces $g(0) = 0$. Conversely, given any such function $g$ we can define $G$ on the finite measure space $(S,\mu)$ by
where $\mu_i$ is $\mu$ of the point $i \in S$.
Continuous functors
obeying the equation $F(X \oplus Y) = F(X) + F(Y)$ correspond to continuous functions
such that $f(0) = 0$ and and $f(x+y) \le f(x) + f(y)$.
Proof Sketch - Use the previous proposition and show that for $G : \Fin\Meas_0 \to \mathbb{R}$ to extend to some (necessarily unique) $F: \Fin\Meas_0 \to \mathbb{R}$, it is necessary and sufficient that the corresponding function $g$ obeys $g(x+y) \ge g(x) + g(y)$. We need this condition so that any map from the 2-point measure space to a 1-point measure space gets sent to a morphism in $\mathbb{R}$. Every map in $\Fin\Meas$ is surjective and every surjection is a composite of coproducts of copies of the identity and these 2-points-to-1 maps, so this is all we need.
Continuous functors
obeying the equations $G(X \oplus Y) = G(X) + G(Y)$ and $G(X \times Y) = G(X) \times G(Y)$ correspond to continuous functions
of the form $g(x) = x^\alpha$ with $\alpha \ge 0$.
Proof - This should follow from our earlier results.
Continuous functors
obeying the equation $F(X \oplus Y) = F(X) + F(Y)$ and $F(X \times Y) = F(X) \times F(Y)$ correspond to continuous functions
of the form $f(x) = x^\alpha$ with $\alpha \ge 1$.
Proof - This should follow from our earlier results.
Suppose $(S,\mu)$ is an object in $\Fin\Meas$. The continuous functor
corresponding to the function $g(x) = x^\alpha$ is then equal to
We may fix $(S,\mu)$ and think of $G(S,\mu)$ as a function of $\alpha \in [0,\infty)$. This function plays an important role in physics, where it is called the partition function of $X = (S,\mu)$. We denote this by
so that
The following proposition says the partition function is a ‘complete invariant’ of positive finite measure spaces:
Two objects of $\Fin\Meas_{\gt 0}$ have the same partition function if and only if they are isomorphic.
Proof - It is obvious that isomorphic objects of $\Fin\Meas_{\gt 0}$ have the same partition function. So, we need to show that we can recover $X = (S,\mu)$ from its partition function. To see this, note that $Z(X)(\alpha)$ is the Laplace transform of this measure on the real line:
Namely:
So, we can recover the measure $\nu$ from the partition function $Z(X)$ by taking its inverse Laplace transform. Furthermore, we can recover the multiset of numbers $\mu_i$ from the measure $\nu$.
We can restate this proposition in an even more complicated way. For any (essentially small) category $C$ let $\underline{C}$ be the decategorification of $C$: that is, the set of isomorphism classes of objects of $C$. If $C$ is a symmetric rig category then $\underline{C}$ will naturally become a commutative rig.
Let $\mathcal{R}$, the rig of partition functions, be the set of functions $f : [0,\infty) \to [0,\infty)$ of the form
where $S$ is an arbitrary finite set and $p_i$ are arbitrary positive real numbers. $\mathcal{R}$ is a closed under pointwise addition and multiplication, making it into commutative rig.
The map $Z: \underline{\Fin\Meas_{\gt 0}} \to \mathcal{R}$ sending each each isomorphism class of strictly positive finite measure space to its partition function is an isomorphism of commutative rigs.
Proof - When you penetrate the jargon, the only nonobvious part here is that $Z$ is a bijection, which was shown in the previous proposition.
The monad for convex sets is a monad on $Set$ sending any set $X$ to the set of finitely-supported probability distributions on $X$.
For example, this monad sends $\{1, \ldots, n\}$ to a set $\mathbf{P}_n$, which can be identified with the $(n - 1)$-simplex. This monad is a finitary monad, so can be thought about in a few different ways.
A finitary monad is the same thing as a finitary algebraic theory. This one can be presented by a family $(*_t)_{t \in [0, 1]}$ of binary operations, subject to some equations. They’re probably exactly those equations that hold in a convex subset of $\mathbb{R}^n$ if we put $x *_t y = (1 - t)x + t y$.
(I suspect there’s a theorem here: that for $n \ge 1$, $\mathbb{R}^n$ “satisfies no laws”. This means there are no equations between the various operations $(x, y) \mapsto (1 - t)x + t y$ beyond those forced by its being an algebra for this theory.)
A finitary algebraic theory can also be thought of as a kind of souped-up operad. In a symmetric operad $P$, one has for each bijection $\sigma: \{1, \ldots, n\} \to \{1, \ldots, n\}$ an induced map $\sigma_*: P_n \to P_n$. In a finitary algebraic theory, one has the same thing for arbitrary functions between finite sets, not just bijections. In other words, a finitary algebraic theory amounts to a non-symmetric operad $P$ together with, for each function $\theta: \{1, \ldots, m\} \to \{1, \ldots, n\}$ between finite sets, an induced map $\theta_*: P_m \to P_n$, satisfying suitable axioms.
The underlying symmetric operad for the convex monad is called the operad for convex algebras, and denote $\mathbf{P}$. An algebra of $\mathbf{P}$ is called a convex algebra.
The space of $n$-ary operations for this operad is $\mathbf{P}_n$, the space of probability distributions on $\{1, \ldots, n\}$. The composition of operations is supposed to be obvious, but we should probably write down a formula for it. The maps “$\theta_*: \mathbf{P}_n \to \mathbf{P}_m$” can be defined by pushforward of measures. An algebra for the algebraic theory of convex algebras is an algebra $X$ for the operad with the further property that the square
commutes for all $\theta: \{1, \ldots, m\} \to \{1, \ldots, n\}$.
Note that $\mathbf{P}$ is naturally a topological operad, where the topology on $\mathbf{P}_n$ is the usual topology on the $(n-1)$-simplex. In our applications, we focus on algebras of $\mathbf{P}$ in topological categories with finite products. We call these convex algebras in $\mathcal{E}$. Here are some examples:
Any convex subset of $\mathbb{R}^n$ is naturally a convex algebra in $Top$. In particular, $[0,\infty)$ is such.
Moreover, if we regard the topological monoid $([0,\infty), +, 0)$ as a one-object topological category, then it is a convex algebra in $Cat(\Top)$.
$\Fin\Meas$ is also a convex algebra in $Cat(\Top)$.
$\Fin\Prob$ is also a convex algebra in $Cat(\Top)$.
Rényi presents a nice characterization of Shannon entropy due to Fadeev here:
It is the first result mentioned in the paper. To translate into our framework, it helps to write a probability measure on the finite set $\{1,\dots,n\}$ as an $n$-tuple of numbers $p_i \in [0,1]$ with $\sum_{i = 1}^n p_i = 1$.
(Fadeev) Suppose $F : \Fin\Prob_0 \to [0,\infty)$ is a continuous functor obeying the magical property:
$F(t p_1, (1-t)p_1, p_2, \dots, p_n) = F(p_1, \dots, p_n) + p_1 F(t, 1-t)$
for all $t \in [0,1]$. Then $F$ is a nonnegative multiple of the Shannon entropy $H_1$.
One may reformulate Fadeev’s theorem in terms of operads. The details are written out here:
What follows is a sketch.
Let $P$ be an operad in a finite product category $\mathcal{E}$. We may speak of (strict) $P$-algebras in $Cat(\mathcal{E})$, the category of internal categories in $\mathcal{E}$. There is a notion of lax map between $P$-algebras in $Cat(\mathcal{E})$.
Briefly put: if $X$ and $Y$ are $P$-algebras, a lax map from $X$ to $Y$ is a functor $X \to Y$ together with, for each $n \in \mathbb{N}$, a natural transformation
satisfying one axiom involving composition in $P$, one involving the unit of $P$, and, if $P$ is symmetric, one involving the symmetric group action on $P$.
Now take $X = 1$. Let us call a lax map $1 \to Y$ a lax point of $Y$. Batanin has called them “internal $P$-algebras” in $Y$, and maybe we’ll end up preferring that name.
Examples
If $\mathcal{E} = Set$ and $P$ is the terminal non-symmetric operad then $Y$ is a monoidal category and a lax point of $Y$ is a monoid in $Y$.
If $\mathcal{E} = Set$ and $P$ is the non-symmetric operad for $M$-sets, where $M$ is a monoid, then $Y$ is a category with an $M$-action and a lax point of $Y$ is an object $y \in Y$ together a map $m \cdot y \to y$ for each $m \in M$, satisfying the obvious action-type axioms.
In what follows we apply this idea to the case where $\mathcal{E} = Top$ and $\mathbf{P}$ is the symmetric, topological operad in which $\mathbf{P}_n$ is the space of probability distributions on $\{1, \ldots, n\}$ (otherwise known as the $(n - 1)$-simplex).
The lax points of the $\mathbf{P}$-algebra $(\mathbb{R}, +, 0)$ are precisely the scalar multiples of Shannon entropy.
Sketch of proof - Write out the axioms in the definition of lax map and unwind them in this particular case. One concerns composition, and this becomes the most substantial of the Fadeev axioms on Shannon entropy: the so-called ‘magical property’ mentioned above. One concerns units, and is redundant. One concerns symmetry. Then because we’re entirely in $Top$, you’ve got one concerning continuity. So we’ve got all the Fadeev axioms except for normalization, and the result follows from (Rényi’s statement of) Fadeev’s theorem.
There’s a very similar characterization of Shannon extropy, or rather the scalar (real) powers of Shannon extropy. There we simply use the $\mathbf{P}$-algebra $((0, \infty), \cdot, 1)$, which is isomorphic to $(\mathbb{R}, +, 0)$ by taking exponentials.
We might also want to get rid of the “scalar multiples/powers” part and hit it on the nose. This seems more feasible with extropy rather than entropy, since there’s no choice of base. Still, I don’t see a nice way to do it.
This section is in a more informal style…
We begin with a sadly familiar problem:
Suppose you live in a town with a limited number of tolerable restaurants. Every Friday you go out for dinner. You randomly choose a restaurant according to a certain probability distribution $P$. If you go to the $i$th restaurant, you then choose a dish from the menu according to some probability distribution $Q_i$. How surprising will your choice be, on average?
Note: I’m only talking about how surprised you are by the choice itself—not about any additional surprise due to the salad being exceptionally tasty, or the cook having dropped her wedding ring into your soup, or something like that. So if there’s only one restaurant in town and they only serve one dish, you won’t be at all surprised by your choice. But if there were thousands of restaurants serving hundreds of dishes each, there’d be room for a lot of surprise.
This still sounds like an impossibly vague puzzle. So, I should admit that while ‘surprise’ sounds psychological and very complicated, I’m really just using it as a cute synonym for Shannon entropy. Shannon entropy can be thought of as a measure of ‘expected surprise’.
Now, ‘expected surprise’ sounds like an oxymoron: how can something expected be a surprise? But in this context ‘expected’ means ‘average’. The idea is that your ‘surprise’ at an outcome with probability $p$ is defined to be $- \ln(p)$. Then, if there are a bunch of possible outcomes distributed according to some probability distribution $P$, the ‘expected’ surprise or Shannon entropy is:
where $p_i$ is the probability of the $i$th outcome. This is a weighted average where each surprise is weighted by its probability of happening.
So, we really do have a well-defined math puzzle, and it’s not even very hard.
But the interesting thing about this problem is that it involves an operation which I’ll call ‘glomming together’ probability distributions. First you choose a restaurant according to some probability distribution $P$ on the set of restaurants. Then you choose a meal according to some probability distribution $Q_i$. If there are $n$ restaurants in town, you wind up eating meals in a way described by some probability distribution we’ll call
A bit more formally:
Suppose $P$ is a probability distribution on the set $\{1,\dots, n\}$ and $Q_i$ are probability distributions on finite sets $X_i$, where $i = 1, \dots, n$. Suppose the probability distribution $P$ assigns a probability $p_i$ to each element $i \in \{1,\dots, n\}$, and suppose the distribution $Q_i$ assigns a probability $q_{i j}$ to each element $j \in X_i$.
Then we can glom together the probability distributions $Q_i$ using $P$ and get a new probability distribution $P \circ (Q_1, \dots, Q_n)$ on the disjoint union of all the sets $X_i$. The resulting probability for each element $(i,j)$ in the disjoint union $\sqcup_{i = 1}^n X_i$ is just $p_i q_{i j}$.
These probabilities sum to 1 as they should:
Note: what $j$ ranges over depends on $i$! But it’s all kosher.
There’s something even simpler than glomming together probability measures: we can glom together numbers!
Suppose we have a probability measure $P$ on the set $\{1, \dots, n\}$ and for each element $i \in S$ we have a number $x_i$. Then we can define a new number $P(x_1, \dots, x_n)$ by
This is just the ‘weighted average’ of the numbers $x_i$. We have already seen a weighted average in the definition of Shannon entropy, equation (2).
Now we’re ready to answer the math puzzle:
On the left we’re glomming together a bunch of probability distributions using $P$ and then taking the entropy. On the right we’re taking the entropies of those probability distributions and then glomming the resulting numbers together using $P$. But there’s also an extra term: the entropy of $P$!
In other words: your expected surprise won’t be just the weighted average of your expected surprises at the various different restaurants, namely $P(S(Q_1), \ldots, S(Q_n))$. It’ll be that plus the expected surprise of your choice of restaurant, namely $S(P)$.
You might not have expected such a simple formula. But it’s easy to prove:
The formula for glomming together entropies is more important than you might think: it comes very close to characterizing Shannon entropy! We can restate Fadeev’s theorem (mentioned above) as follows:
(Fadeev) Suppose $F$ is a function sending probability measures on finite sets to real numbers. Suppose that:
$F$ is invariant under bijections.
$F$ is continuous.
$F(P\circ(Q_1, \ldots, Q_n)) = F(P) + P(F(Q_1), \ldots, F(Q_n))$.
Then $F$ is a real multiple of Shannon entropy.
In item 1 we’re using the fact that a bijection $f: X \to X'$ between finite sets together with a probability distribution on $X$ gives a probability distribution on $X'$; we want these to have the same entropy. In item 2 we’re using the standard topology on $\mathbb{R}^n$ to put a topology on the set of probability distributions on any $n$-element set. For a proof of this theorem, see the beginning of Rényi's 1961 paper.
While this equation is cute:
it’s a bit tricky to find its proper place in the world of abstract algebra. A probability distribution $P$ can act either on a list of probability distributions or a list of numbers. Shannon entropy gives a map $S$ from probability distributions to numbers. So, if you’re algebraically inclined, you would want $S$ to be a ‘homomorphism’, obeying a law like this:
We see laws of this sort all over math. But the true law has an extra term. What’s going on?
The ‘glomming’ business can be formalized using operads, and Tom’s answer is roughly: Shannon entropy is not a ‘homomorphism’ of operad algebras, but only a ‘lax homomorphism’. For an explanation of what this means, go here.
Right now I want to propose an alternative answer. I hope we can combine it with Tom’s answer and reach a deeper understanding of what’s going on.
For starters, consider another law that has an ‘extra term’:
In math jargon, the product rule says that taking the derivative of a function at a point is not a homomorphism from smooth functions to real numbers: it’s a ‘derivation’. We can get a derivation on an algebra by differentiating a family of algebra homomorphisms. Similarly, we can get the funny rule obeyed by entropy by differentiating a family of operad algebra homomorphisms.
Let’s see how.
There’s something more fundamental than the Shannon entropy of a probability distribution: namely, it’s partition function. At least that’s how physicists think. Let me explain.
Suppose $P$ is a probability measure on a finite set $X$. Let $p_i$ be the probability of the element $i \in X$. We can always write
where
is a function called the energy or Hamiltonian. The idea is that the probability of a system being in some state decreases exponentially with the energy of that state; we allow the energy $+\infty$ to account for zero probabilities.
But physicists always go further and introduce a parameter $\beta \in (0,\infty)$ which stands for the reciprocal of temperature, to capture the fact that states of high energy are even less likely to be occupied when it’s cold. Now we make $p_i$ into a function of $\beta$:
or if you prefer, simply
When $\beta = 1$, these numbers are just the probabilities $p_i$. But when $\beta \ne 1$, there’s no reason for them to sum to 1. To get a probability measure, we’d have to divide by a fudge factor called the partition function:
But I won’t do that just now: I’ll let $P(\beta)$ be the measure on the set $X$ that assigns the measure $p_i(\beta)$ to the point $i \in X$.
Now everything is a function of $\beta$! But everything reduces to something familar at $\beta = 1$, so we can stretch our old notation without breaking it. We now have functions $p_i$ sending numbers $\beta \in (0,\infty)$ to numbers, and a function $P$ sending numbers $\beta$ to measures on $X$. These reduce to our original $p_i$ and $P$ at $\beta = 1$. The partition function is also a function of $\beta$, and this equals 1 at $\beta = 1$.
But here’s the important thing: the partition function obeys a simpler law than entropy when we glom probability measures together! Suppose $P$ is a probability distribution on the set $\{1,\dots, n\}$ and $Q_i$ are probability distributions on finite sets $X_i$, where $i = 1, \dots, n$. Then
So, in fancy jargon, $Z$ is an operad algebra homomorphism!
But I need to explain what this equation means. First of all, everything is a function of $\beta$. Second, while $P$ and $Q_i$ are only probability measures at $\beta = 1$, they’re perfectly fine measures at other values of $\beta$, so all our ‘glomming’ formulas still make sense.
Let’s check to make sure we know what’s going on. An expression like $P \circ (Q_1, \dots, Q_n)$ could be ambiguous. We have a recipe for taking a probability measure on a finite set and extending it to a function from numbers $\beta \in (0,\infty)$ to measures on that set. So, we can extend $P$ and $Q_i$ this way and then ‘glom’ them to get a measure-valued function of $\beta$. On the other hand, we can glom them and then extend them. Luckily, the results agree.
Let’s see why. Suppose $P$ assigns the point $i \in X$ the probability $p_i$. When we extend this to a function of $\beta$ we get $p_i^\beta$. Similarly, suppose $Q_i$ assigns the point $j \in X_i$ the probability $q_{ i j}$. When we extend this to a function of $\beta$ we get $q_{i j}^\beta$. When we glom these, the measure of the point $(i,j) \in \sqcup X_i$ will be this function of $\beta$:
But this equals
which is the result of glomming and then extending.
The right-hand side of equation (3) is also unambiguous… but I’m wasting time: if I were a physicist I’d be done proving this equation by now, instead of stewing around explaining what it means. It’s incredibly easy to prove. From what I’ve said,
How is the Shannon entropy of a probability distribution related to its partition function? Simple:
Here I must emphasize that I’m only talking about the Shannon entropy $S(P)$ of the original probability distribution, ‘at $\beta = 1$’. Physicists extend $S(P)$ to a function of $\beta$, along with everything else. That would be very good to do, but it goes beyond our goal now, and it would make the formula relating $S$ and $Z$ a bit more complicated, so let’s not.
Our goal now is simply to get the rule for glomming entropies by differentiating the rule for glomming partition functions. So, let’s do that using equation (4). Later I’ll show you why (4) is true.
We start with the rule for glomming partition functions:
We differentiate with respect to $\beta$ and use the product rule, taking advantage of the fact that $P (Z(Q_1), \dots, Z(Q_n))$ is linear in $P$ but also linear in each of the functions $Z(Q_i)$:
Now set $\beta = 1$. We can use equation (4) and the fact that all our partition functions equal 1 at $\beta = 1$:
Hey, it looks like we’re almost done! As you can see, the product rule did most of the work, so we’re really saying that $S$ is like a ‘derivation’. We just to work out that funny-looking first term on the right-hand side. It amounts to taking a weighted sum of 1’s, which is just a sum:
and we have
so
so
So, we’ve got it!
The funny-looking rule for glomming entropies is just the derivative of the rule for glomming partition functions!
But before we go further, let’s check rule (4). For starters,
But we’ve just seen that
so
As we’ve seen already, there is deeper reason for being interested in the partition function: it’s actually a form of ‘decategorification’!
Let $\Fin\Prob_0$ be the core of $\Fin\Prob$, as defined above. Then
can be thought of as a functor from $Fin\Prob_0$ to its set of isomorphism classes, viewed as a category with only identity morphisms. In other words, $Z$ assigns to each object $P \in Fin\Prob_0$ its partition function $Z(P)$… but we can recover $P$ up to isomorphism from $Z(P)$.
Now, $\Fin\Prob_0$ is an algebra of a certain operad $\mathbf{P}$ whose $n$-ary operations are the probability measures on the set $\{1, \dots, n\}$. This is just a fancy way of talking about ‘glomming probability measures’. As a kind of consequence, the set of partition functions also becomes a $\mathbf{P}$-algebra. Furthermore, $Z$ becomes a homomorphism of $\mathbf{P}$-algebras. This last statement mostly just says that
But then we can take $Z$, differentiate it with respect to $\beta$, and set $\beta = 1$. By abstract nonsense, the resulting functor should be some sort of ‘derivation’ of the $Conv$-algebra $\Fin\Prob_0$. Concretely, we’ve seen this mostly says that
But there should also be an abstract-nonsense theory of derivations of operad algebras. (This was discussed this a bit back in week299, but only a little). So, an interesting question presents itself:
How does the ‘derivation’ way of thinking about the law $S(P\circ(Q_1, \ldots, Q_n)) =$ $S(P) + P(S(Q_1), \ldots, S(Q_n))$ relate to Tom Leinster’s interpretation of it in terms of lax operad homomorphisms, or more precisely ‘lax points’?
If we get this straightened out, it will also be tempting to extend the story to Rényi entropy, using the fact that Rényi entropy is a kind of $q$-derivative of minus the logarithm of the partition function. The $q$-derivative is a kind of ‘twisted derivation’, but a bunch of the same calculations may work in some modified form.
There is a categorical characterization of the power means (of order $\ge 1$), similar in spirit to some of the other results on this page. It follows from the characterization in
Write $\mathbb{R}_+ = [0, \infty)$. For each $t \ge 0$, there is a convex algebra structure on the set $\mathbb{R}_+$, given by taking the power mean of order $t$. The action
is given by $M_t(p, x) = (\sum_i p_i x_i^t)^{1/t}$ if $0 \lt t \lt \infty$, or $\prod x_i^{p_i}$ if $t = 0$, or $\max_{i: p_i \gt 0} x_i$ if $t = \infty$.
Now put the order $\ge$ on $\mathbb{R}_+$ to make it into a category. Since $M_t(p, -)$ is increasing (for each $t$ and $p$), power mean of any order $t \geq 0$ makes $\mathbb{R}_+$ into a convex algebra in $Cat$.
Now put the additive structure $(+, 0)$ on $\mathbb{R}_+$, to make it into a monoidal category. For $t \geq 1$ we have
and $M_t(p, 0) = 0$, so $M_t(p, -)$ is a lax monoidal functor. So power mean of any order $t \geq 1$ makes $\mathbb{R}_+$ into a convex algebra in $\Mon\Cat_\ell$, the category of monoidal categories and lax monoidal functors.
This structure is, moreover, homogeneous:
whenever $c \in \mathbb{R}_+$ and $(p, x) \in \mathbf{P}_n \times \mathbb{R}_+^n$.
The convex algebra structures on $\mathbb{R}_+ \in \Mon\Cat_\ell$ satisfying the homogeneity axiom are precisely the power means $M_t$ ($1 \leq t \leq \infty$).
Sketch proof - We show that any homogeneous convex algebra structure on $\mathbb{R}_+$ satisfies the hypotheses of the characterization theorem in arXiv:1103.2574, and is therefore one of the power means. This is elementary manipulation. The axiom called “functoriality” in that paper is exactly the commutativity of the square drawn above. “Consistency” is the unit axiom for an operad action. “Multiplicativity” follows from the composition axiom for an operad action, together with homogeneity.
I wouldn’t call this result a rephrasing of the result in my paper; it’s probably rather weaker (i.e. easier to prove), because the all-important multiplicativity axiom involves only a very restricted kind of operadic composition.
The proposition would be nicer if the homogeneity axiom could be either dropped or turned into something more categorical. I don’t know what to do about this.
One possibility is to observe that the monoid $(\mathbb{R}_+, \cdot, 1)$ acts on the monoidal category $(\mathbb{R}_+, \geq, +, 0)$, and that homogeneity amounts to preservation of that action. So we could think about convex algebras in $\mathbb{R}_{+}- \Mon\Cat_\ell$, the category of monoidal categories equipped with an action of the multiplicative monoid $\mathbb{R}_+$. That should work, but seems clumsy.
Another possibility is to use the following strange kind of monoidal functor. Let’s say that a lax monoidal functor $F: A \to B$ is homogeneous if for all $a \in A$ and $n \in \mathbb{N}$, the coherence map
is invertible. The composite of homogeneous functors is again homogeneous. Let $\Mon\Cat_h$ be the category of monoidal categories and homogeneous lax functors. Then the convex algebra structures on $\mathbb{R}_+ \in \Mon\Cat_h$ are exactly the power means of order $t \in [1, \infty]$. (Homogeneity of this kind only tells us that $M(p, c x) = cM(p, x)$ when $c$ is a natural number, but we can deduce it for all positive reals.) That’s a reasonably clean statement, but the concept of homogeneous lax functor seems strange, and was invented only for this purpose. I’ve never seen it before.
I don’t know whether the homogeneity axiom is simply redundant. I’ve tried to prove that it is, and I’ve tried to prove that it isn’t, without success.
This is Tom, still.
To study quadratic forms, you really need to study bilinear forms. Similarly, I wonder whether here, it would help to look not just at expressions such as $\sum_i p_i^\alpha$, but also at the associated expressions $\sum_i p_i x_i^{\alpha - 1}$. (You see such expressions in the definition of power mean; I have in the back of my mind the kind of thing in arXiv:1103.2574, and closely related things in my work on diversity.)
In any case, here’s a theorem in the same spirit as some of those above.
Let $\Fin\Meas\Fun$ be the topological category in which:
an object is a finite set $S$ equipped with a (nonnegative) measure $\mu$ and a function $\phi: S \to [0, \infty)$
a map $(S, \mu, \phi) \to (S', \mu', \phi')$ is a map $f: S \to S'$ of finite sets such that $f_*(\mu) = \mu'$ and $f^*(\phi') = \phi$ (i.e. $\phi'\circ f = \phi$).
Since the base space is a mere finite set, $\mu$ and $\phi$ are actually of the same type. The fancy terminology hides this, and maybe we’re embarrassed to admit it – we’re probably dreaming of higher things. Nevertheless, we’ll take advantage of it later.
$\Fin\Meas\Fun$ is a topological rig category, in an obvious way. So too is $\mathbb{R}_0$, the topological rig of all real numbers regarded as a rig category with no maps other than identities. (The $0$ subscript is supposed to mean “the only maps are isos”, as in what John wrote.)
A rig functor
is determined by its effect on objects, since $\mathbb{R}_0$ is categorically discrete. It’s the same thing as a homomorphism from $\Pi_0(\Fin\Meas\Fun)$, the rig of connected-components of $\Fin\Meas\Fun$, to $\mathbb{R}$. For each $t \geq 0$, there’s a continuous rig functor $J_t: \Fin\Meas\Fun \to \mathbb{R}_0$ given by
(Or more fancily: $J_t(S, \mu, \phi) = \int_S \phi^t d\mu$. In the case $t = 0$ and $\phi(s) = 0$, continuity forces us to put $0^0 = 1$.)
The continuous rig functors $\Fin\Meas\Fun \to \mathbb{R}_0$ are precisely the functors $J_t$ ($t \geq 0$).
Remark: I know $2 \times 2 = 4$ versions of this proposition, including the one just stated. The first choice is whether to “allow zero” or not. By “allowing zero” I mean that in the definition of $\Fin\Meas\Fun$, $\mu_s \geq 0$ and $\phi(s) \geq 0$ (as above); by “not allowing zero” I mean that we demand $\mu_s \gt 0$ and $\phi(s) \gt 0$.
The second choice is whether to work internally to $Top$ or $Poset$. (The point here is that in order to exclude the non-obvious solutions to the Cauchy functional equation $\psi(a + b) = \psi(a) + \psi(b)$, you need some extra hypothesis such as continuity or order-preservation.) The order on the objects of $\Fin\Meas\Fun$ is: $(S, \mu, \phi) \leq (S', \mu', \phi')$ iff $S = S'$, $\mu_s \leq \mu'_s$ for all $s \in S$, and $\phi(s) \leq \phi'(s)$ for all $s \in S$. The order on the maps of $\Fin\Meas\Fun$ is: $f \leq g$ iff $dom(f) \leq dom(g)$, $cod(f) \leq cod(g)$, and $f = g$ as functions.
Here are the four versions:
Allowing 0 and working in $Top$. This is the proposition above, where the continuous rig functors are $J_t$ for $t \geq 0$.
Not allowing 0 and working in $Top$. Then we get $J_t$ for all $t \in \mathbb{R}$.
Allowing 0 and working in $Poset$. Then we get $J_t$ for $t \geq 0$, together with another rig functor $J_\varepsilon$. This new functor $J_\varepsilon$ is almost the same as $J_0$, but whereas in the definition of $J_0$ we take $0^0 = 1$, in the definition of $J_\varepsilon$ we take $0^0 = 0$. Explicitly, $J_\varepsilon(S, \mu, \phi) = \lim_{t \to 0+} J_t(S, \mu, \phi) = \mu(supp \phi)$ (whereas $J_0(S, \mu, \phi) = \mu(S)$).
Not allowing 0 and working in $Poset$. Then we get $J_t$ for $t \geq 0$ (but not $J_\varepsilon$; or rather, we don’t see the difference between $J_\varepsilon$ and $J_0$).
: I’ll continue to use the same setup as in the Proposition (allowing $0$ and working in $Top$), for no other reason than that it’s the one I first thought of.
Digression: Actually, what I’m really taking advantage of is the fact that any measure on a finite set gives rise to a function. This is, if you like, its Radon-Nikodym derivative with respect to counting measure. So counting measure is playing a silent role as the canonical reference measure on a finite set.
I’ll invent some terminology. A map $f: (S, \mu) \to (S', \mu')$ in $\Fin\Meas$ is an equivalence if there exist subsets $A \subseteq S$ and $A' \subseteq S'$ such that $f^{-1} A' = A$, the induced map $A \to A'$ is a bijection, and $\mu(S \setminus A) = 0 = \mu'(S' \setminus A')$. I think this is a standard concept in measure theory, but I forget the standard name; maybe just “measure-isomorphism”?
The maps $(S, \mu, \mu) \to (S', \mu', \mu')$ in $\Fin\Meas\Fun$ are the equivalences $(S, \mu) \to (S', \mu')$.
Proof - straightforward manipulation of the definitions.
Let $\Fin\Meas_1$ be the category whose objects are those of $\Fin\Meas$ and whose maps are the equivalences. Thus, $\Fin\Meas_1$ embeds as a full subcategory of $\Fin\Meas\Fun$:
via $(S, \mu) \mapsto (S, \mu, \mu)$. Also, $\Fin\Meas_1$ is closed under the rig category operations on $\Fin\Meas\Fun$, so it’s a sub-rig-category.
John’s remarks on convergence in $\Fin\Meas$ suggest that perhaps he really wants to be using $\Fin\Meas_1$ instead of $\Fin\Meas_0$…? In any case, $\Fin\Meas_0$ is a sub-rig-category of $\Fin\Meas_1$, and it’s easy to see that the continuous rig functors $\Fin\Meas_1 \to \mathbb{R}_0$ are exactly those of the form $G_\alpha$ ($\alpha \in \mathbb{R}$), where
This is basically Corollary 1 above.
We now have a commutative triangle, or we would if I knew how to draw one. The composite
is equal to
for every $\alpha \in \mathbb{R}$. Algebraically, this just says that
But it seems to me to be noteworthy, because we naturally see the appearance of $\alpha - 1$ (as opposed to $\alpha$), which is a recurrent player in this drama.
For random variables $X$, $Y$, let us write $(X,Y)$ for the ‘joint random variable’ whose distribution is the joint distribution of $X$ and $Y$. Its entropy is the joint entropy $H((X,Y))$, which we also write as $H(X,Y)$.
The basic inequalities of information theory are the following inequalities for Shannon entropy:
Some of these inequalities can be generalized to relative Shannon entropy, Rényi entropy, or (combining both generalizations) relative Rényi entropy, also known as ‘Rényi divergence’. For a fairly thorough introduction to inequalities obeyed by Rényi divergence, see:
• T. A. L. van Even, When Data Compression and Statistics Disagree: Two Frequentist Challenges For the Minimum Description Length Principle, Chap. 6: Rényi Divergence, Ph.D. thesis, Leiden University, 2010.
Here however we will start by studying these inequalities for the Shannon entropy of probability measures on finite sets.
The first three types of inequalities can be seen as degenerate cases of the fourth, when taking one of the variables being deterministic.
We have already seen how to formulate the first two properties categorically: for the Shannon entropy functor
the non-negativity of entropy is clear by the definition of the target category, while the non-negativity of conditional entropy follows by considering $H$ on the projection map $(X,Y)\rightarrow Y$.
But what about the other two types of basic inequalities?
Here’s how to do it for the non-negativity of mutual information. First of all, we need to make sense of the concept of joint distribution. Regarding random variables $X$ and $Y$ as objects of $\Fin\Prob$ does not yet specify a joint distribution; in general, there will be many joint ddistributions $(X,Y)$ compatible with the given distributions of $X$ and $Y$. What we need is to assume the existence of a sample space on which $X$ and $Y$ live.
So consider an object $\Omega\in \Fin\Prob$; we think of it as the sample space conditional to which all other random variables will be considered: $X$ is a random variable on $\Omega$ if it comes equipped with a morphism $\Omega\rightarrow X$. In other words, a random variable on $\Omega$ is an object in the coslice category $\Omega/\Fin\Prob$.
If $X$ and $Y$ are random variables over $\Omega$, then their joint distribution $(X,Y)$ is equal to the categorical product $X\times_{\Omega}Y$ in $\Omega/\Fin\Prob$.
Proof - Straightforward.(?)
As a side remark, this also shows the following:
The category $\Fin\Prob$ does not have products.
Proof - If $\Fin\Prob$ would have products, then any coslice category $\Omega/\Fin\Prob$ would inherit these products. However since the joint distribution of two random variables with given distribution does in general depend on the sample space on which these random variables live, the product in $\Omega/\Fin\Prob$ does depend on $\Omega$.
We resume the main line of thought. By composing the functor $H$ with the natural functor $\Omega/\Fin\Prob\rightarrow\Fin\Prob$, we can also consider Shannon entropy as a functor on $\Omega/\Fin\Prob$, which is, by abuse of notation, also denoted by $H$.
Regarding the categorical product as the monoidal structure on $\Omega/\Fin\Prob$, Shannon entropy becomes a lax monoidal functor
Proof - The monoidal unit of $\Fin\Prob$ is given by the terminal object, which is any single-outcome random variable $\ast$, while in $\mathbb{R}_{\geq 0}$ the monoidal unit is $0$. Thus the compatibility of $H$ with the monoidal unit states that
which is equivalent to the condition that the one-outcome random variable $\ast$ has vanishing entropy, $H(\ast)=0$.
The compatibility of $H$ with the monoidal product requires
which is precisely the non-negativity of mutual information since $H(X\times_{\Omega}Y)=H(X,Y)$.
We have just seen that coslice categories under objects in $\Fin\Prob$ are relevant for defining joint distributions of random variables. Surprisingly, also slice categories over objects in $\Fin\Prob$ play an important role. Let us fix an object $Z\in \Fin\Prob$ and consider the slice category $\Fin\Prob/Z$. An object $(X,f_X)\in \Fin\Prob/Z$ is a finite probability space $X$ together with a measure-preserving function $f_X:X\rightarrow Z$. We may think of $f_X$ as defining a partitioning of the domain of $X$, or as a fibration with base $Z$.
Given an object $(X,f_X)\in\Fin\Prob/Z$, we define the conditional entropy by
(Since this is not exactly the usual definition of conditional entropy, we use a slightly different notation.) This quantity measures the information content of $X$ conditional to $Z$: given that we know the value of $Z$, how much more information do we expect to get when learning the value of $X$?
This reduces to the usual concept of conditional Shannon entropy as follows: given random variables $A$ and $B$, the conditional entropy is given by
which is of the above form if we take $Z=B$, $X=(A,B)$ and $f:X\rightarrow Z$ to be the projection onto the second component.
On the category $\Fin\Prob/Z$, there is a monoidal product defined as follows: for objects $(X,f_X)$ and $(Y,f_Y)$ the product ‘over $Z$’ is given by the pullback
together with the measure defined as
which just states that $X$ and $Y$ are taken to be ‘fiberwise independent’ random variables. (When the denominator of this expression vanishes, its numerator also vanishes automatically, and we take the whole expression to be $0$.)
It should be clear how this monoidal product operates on morphisms. Its unit is given by the object $(Z,\mathrm{id}_Z)$.
With these definitions, the conditional Shannon entropy $H(\cdot/Z)$ is a monoidal functor
Proof - Functoriality is clear by the previous results. That $H(\cdot/Z)$ preserves the monoidal unit means that
So it remains to check that $H(\cdot/Z)$ preserves the monoidal product. This follows from a straightforward calculation:
Now it is time to combine these observations in order to find a categorical formulation of the non-negativity of conditional mutual information, which subsumes all the other basic information inequalities as degenerate cases.
So let us fix a sample space $\Omega\in\Fin\Prob$ and a ‘cosample space’ $Z\in\Fin\Prob$. We also need to fix a measure-preserving function $g:\Omega\rightarrow Z$ which anchors $Z$ as a random variable on $\Omega$. We now consider the category of probability spaces ‘between’ $\Omega$ and $Z$. More precisely, we take $\Omega/\Fin\Prob/Z$ to be the category where an object is a triple $(X,i_X,f_X)$ consisting of a probability space $X\in\Fin\Prob$ and measure-preserving functions $i_X:\Omega\rightarrow X$ and $f_X:X\rightarrow Z$ such that $f_X\circ i_X=g$. As morphisms of $\Omega/\Fin\Prob/Z$, we take those measure-preserving functions which are compatible with the given data $i_{\cdot}$ and $f_{\cdot}$.
This is not a comma category, is it? Should it still be called ‘bislice category’?
If $(X,i_X,f_X), (Y,i_Y,f_Y)\in\Omega/\Fin\Prob/Z$, then the categorical product, denoted by $X\times^Z_\Omega Y$, is the joint distribution of $X$ and $Y$ on the sample space $\Omega$.
Proof - Straightforward.(?)
As above, this gives the same consequence:
The category $\Omega/\Fin\Prob$ does not have products.
The object $(Z,g,\mathrm{id}_Z)$ is the unit of this monoidal structure.
With respect to this monoidal structure, the conditional entropy $H(\cdot/Z)$ is a lax monoidal functor
The compatibility with the monoidal product is equivalent to the non-negativity of conditional mutual information.
Proof - Again functoriality is clear. For the unit, being lax monoidal means that
which holds true by $H(Z/Z)=H(Z)-H(Z)=0$. The compatibility of the monoidal products states that
That this inequality holds for Shannon entropy follows from second the assertion, stating that it is equivalent to the non-negativity of conditional mutual information. This equivalence still remains to be proven. To see this, note that $X$ is isomorphic to the joint variable $(X,Z)$, and similarly $Y$ is isomorphic to $(Y,Z)$. With this, (5) reads
Upon expanding the conditional entropies in terms of joint entropies, this becomes
Therefore (5) follows from the non-negativity of conditional mutual information.
The converse direction works similarly: given a triple of random variables $X$, $Y$, $Z$ on $\Omega$, we can consider $(X,Z)$ and $(Y,Z)$, which are also random variables on $\Omega$ and come equipped with the projection to $Z$, as objects in $\Omega/\Fin\Prob/Z$. Then again, (5) is equivalent to (6).
The logarithm of $Z$ has the physical meaning of free energy. The logarithm of $Z$ divided by $1 - \alpha$ is called the Rényi entropy:
This can be seen as roughly a $q$-derivative of the free energy; see
for details.
For $\alpha \gt 0$ with $\alpha \ne 1$, $H_\alpha$ defines a continuous functor
The case $\alpha = 1$ can probably be taken care of via a limit; for now let’s assume so.
Thanks to
the Rényi entropy obeys
For probability measures the Rényi entropy also obeys
Thus, if we treat $[0,\infty)$ as a topological subcategory of $\mathbb{R}$, the Rényi entropy restricts to a continuous functor
where we treat probability measure spaces where the only set of measure zero is the empty set as forming a topological subcategory $\Fin\Prob$ of $\Fin\Meas$.
Furthermore
where we write a probability measure on a set of the form $\{1,\dots, \ell \}$ as a list of nonnegative numbers that sum to 1. Thus we have:
For $\alpha \ge 1$, the Rényi entropy $H_\alpha : \Fin\Prob_0 \to [0,\infty)$ extends to a continuous functor
This is Tobias. I just want to write things down in little detail to at least have a record. So this section needs much more care. In particular, if we decide to only consider cases where are probalities are strictly positive, then one can apply continuity arguments.
Let $p=(p_1,\ldots,p_n)$ and $r=(r_1,\ldots,r_n)$ be two probability distributions on an $n$-element set. Then the relative Rényi entropy of order $q$ is given by
Here, we set $D_q(p||r)$ to be $\infty$ whenever there is an index $i$ with $r_i=0$, but $p_i\gt 0$. In terms of power means,
where $\frac{p}{r}(i)=\frac{p(i)}{r(i)}$ is the Radon-Nikodym derivative (we set $0/0=0$ here). $D_q(q||r)$ is non-negative for $q\in[0,\infty]$.
To Do?: see whether the notation is good and add some more introductory bla-bla
What are good properties of Rényi entropy? It is clearly invariant under permutations of the of the indices when these are applied to $p$ and $q$ at the same time. Furthermore, it is also invariant under adding additional randomness: we can add additional randomness to $p$ and $q$ by choosing some $\lambda\in[0,1]$ and considering the $(n+1)$-outcome distributions $p'=(\lambda p_1,(1-\lambda)p_1,p_2,\ldots,p_n)$ and $r'=(\lambda r_1,(1-\lambda)r_1,r_2,\ldots,r_n)$. Then,
So under which kind of operation does $D_q$ change at all? One thing we have not considered so far is coarse-graining, i.e. the identification of some outcomes to a single one. Again applying this to both distributions at the same time, we obtain e.g. the distributions $\hat{p}=(p_1+p_2,p_3,\ldots,p_n)$ and $\hat{r}=(r_1+r_2,r_3,\ldots,r_n)$.
In this notation,
for all $q\in[0,\infty]$.
Proof - For $q\geq 1$, this inequality states that
which is equivalent to
with $w=\left(\frac{p_1}{p_1+p_2},\frac{p_2}{p_1+p_2}\right)$ and $x_i=p_i/r_i$. This in turn always holds since the power means $M_t$ are increasing functions of the parameter $t$. Allowing the second argument of the power means to also take on the value $\infty$ (which means that, due to the particular form of the arguments, that the corresponding weight will not be $0$), this also still holds in the degenerate case where some $r_i$ vanishes.
(Under construction…)