# nLab conditional expectation

### Context

#### Measure and probability theory

measure theory

probability theory

# Contents

## Idea

If $\left(\Omega ,𝔄,P\right)$ is a probability space, the conditional expectation $E\left[X\mid \Sigma \right]$ of a (measurable) random variable $X$ with respect to some sub-$\sigma$-algebra $\Sigma \subseteq 𝔄$ is some measurable random variable which is a ”coarsened” version of $X$. We can think of $E\left[X\mid \Sigma \right]$ as a random variable with the same domain but which is measured with a sigma algebra containing only restricted information on the original event since to some events in $𝔄$ has been assigned probability $1$ or $0$ in a consistent way.

## Conditional expectation relative to a random variable

Let $\left(\Omega ,𝔄,P\right)$ be a probability space, let $Y$ be a measurable function into a measure space $\left(U,\Sigma ,{P}^{Y}\right)$ equipped with the pushforward measure? induced by $Y$, let $X:\left(\Omega ,𝔄,P\right)\to \left(ℝ,ℬ\left(ℝ\right),\lambda \right)$ be a real-valued random variable?.

Then for $X$ and $Y$ there exists a essentially unique (two sets are defined to be equivalent if their difference is a set of measure $0$) integrable function $g=:E\left[X\mid Y\right]$ such that the following diagram commutes:

$\begin{array}{ccc}\left(\Omega ,𝔄,P\right)& \stackrel{Y}{\to }& \left(U,\Sigma ,{P}^{Y}\right)\\ {↓}^{X}& & {↙}_{g=:E\left[X\mid Y\right]}\\ \left(ℝ,ℬ\left(ℝ\right),\lambda \right)\end{array}$\array{ (\Omega,\mathfrak{A},P)& \stackrel{Y}{\to}& (\U, \Sigma, P^Y) \\ \downarrow^{\mathrlap{X}} && \swarrow_{\mathrlap{g=:E[X|Y]}} \\ (\mathbb{R},\mathcal{B}(\mathbb{R}),\lambda) }

where $g:y↦E\left[X\mid Y=y\right]$. Here ”commutes” shall mean that

(1) $g$ is $\Sigma$-measurable.

(2) the integrals over $X$ and $g\circ Y$ are equal.

In this case $g=E\left[X\mid Y\right]$ is called a version of the conditional expectation of $X$ provided $Y$.

In more detail (2) is equivalent to that for all $B\in \Sigma$ we have

${\int }_{{Y}^{-1}\left(B\right)}X\left(\omega \right)dP\left(\omega \right)={\int }_{B}g\left(u\right)d{P}^{Y}\left(u\right)$\int_{Y^{-1}(B)}X(\omega)d P(\omega)=\int_B g(u)d P^Y (u)

and to

${\int }_{{Y}^{-1}\left(B\right)}X\left(\omega \right)dP\left(\omega \right)={\int }_{{Y}^{-1}\left(B\right)}\left(g\circ Y\right)\left(\omega \right)dP\left(\omega \right)$\int_{Y^{-1}(B)}X(\omega)d P(\omega)=\int_{Y^{-1}(B)}(g\circ Y)(\omega)d P (\omega)

(The equivalence of the last two formulas is given since we always have ${\int }_{B}g\left(u\right)d{P}^{Y}\left(u\right)={\int }_{{Y}^{-1}\left(B\right)}\left(g\circ Y\right)\left(\omega \right)dP\left(\omega \right)$ by the substitution rule.)

Note that it does not follow from the preceding definition that the conditional expectation exists. This is a consequence of the Radon-Nikodym theorem as will be shown in the following section. (Note that the argument of the theorem applies to the definition of the conditional expectation by random variables if we consider the pushforward measure? as given by a sub-$\sigma$-algebra of the original one. In this sense $E\left[X\mid Y\right]$ is a ”coarsened version” of $X$ factored by the information (i.e. the $\sigma$-algebra) given by $Y$.)

## Conditional expectation relative to a sub-$\sigma$-algebra

Note that by construction of the pushforward-measure it suffices to define the conditional expectation only for the case where $\Sigma :=𝔖\subseteq 𝔄$ is a sub-$\sigma$-algebra.

(Note that we loose information with the notation ${P}^{Y}$; e.g ${P}_{𝔄}^{\mathrm{id}}$ is different from ${P}_{𝔖}^{\mathrm{id}}$)

The diagram

$\begin{array}{ccc}\left(\Omega ,𝔄,P\right)& \stackrel{\mathrm{id}}{\to }& \left(\Omega ,𝔖,{P}^{\mathrm{id}}\right)\\ {↓}^{X}& & {↙}^{Z=:E\left[X\mid 𝔖\right]}\\ \left(ℝ,ℬ\left(ℝ\right),\lambda \right)\end{array}$\array{ (\Omega,\mathfrak{A},P)& \stackrel{id}{\to}& (\Omega, \mathfrak{S}, P^{id}) \\ \downarrow^X&& \swarrow^{Z=:E[X|\mathfrak{S}]} \\ (\mathbb{R},\mathcal{B}(\mathbb{R}),\lambda) }

is commutative (in our sense) iff

(a) $Z$ is $𝔖$-measurable

(b) ${\int }_{A}ZdP={\int }_{A}XdP$, $\forall A\in 𝔖$

We hence can write the conditional expectation as the equivalence class

$E\left[X\mid 𝔖\right]=\left\{Z\in {L}^{1}\left(\Omega ,F,P\right)\mid {\int }_{A}\mathrm{ZdP}={\int }_{A}\mathrm{XdP}\phantom{\rule{thickmathspace}{0ex}}\forall A\in 𝔖\right\}$E[X|\mathfrak{S}]=\{Z\in L^1 (\Omega, F,P)|\int_A ZdP=\int_A XdP\;\forall A\in \mathfrak{S}\}

An element of this class is also called a version.

###### Theorem

$E\left[X\mid 𝔖\right]$ exists and is unique almost surely.

###### Proof

Existence: By

$Q\left(A\right):={\int }_{A}X\left(\omega \right)P\left(d\omega \right)$Q(A):=\int_A X(\omega)P(d\omega)

$A\in 𝔄$ is defined a measure $Q$ on $\left(\Omega ,𝔄,P\right)$ (if $X\ge 0$; if not consider the positive part ${X}^{+}$ and the negative part ${X}^{-}$ of $X={X}^{+}-{X}^{-}$ separate and use linearity of the integral). Let $P{\mid }_{𝔖}$ be the restriction of $P$ to $𝔖$. Then

$Q<Q\lt\lt P|_{\mathfrak{S}}

meaning: $P{\mid }_{𝔖}\left(M\right)=0⇒Q\left(M\right)=0$ for all $M\in 𝔖$. This is the condition of the theorem of Radon-Nikodym (the other condition of the theorem that $P{\mid }_{𝔖}$ is $\sigma$-finite is satisfied since $P$ is a probability measure). The theorem implies that $Q$ has a density w.r.t $P{\mid }_{𝔖}$ which is $E\left[X\mid 𝔖\right]$.

Uniqueness: If $g$ and ${g}^{\prime }$ are candidates, by linearity the integral over their difference is zero.

## Conditional probability

From elementary probability theory we know that $P\left(A\right)=E\left[{1}_{A}\right]$.

For $A\in 𝔖$ we call $P\left(A\mid 𝔖\right):=E\left[{1}_{A}\mid 𝔖\right]$ the conditional probability of $A$ provided $B$.

## Integral kernel, Stochastic kernel

In probability theory and statistics, a stochastic kernel is the transition function of a stochastic process. In a discrete time process with continuous probability distributions, it is the same thing as the kernel of the integral operator that advances the probability density function.

### Integral kernel

An integral transform $T$ is an assignation of the form

$\left(\mathrm{Tf}\right)\left(u\right)=\int K\left(t,u\right)f\left(t\right)\mathrm{dt}$(Tf)(u)=\int K(t,u)f(t)dt

where the function of two variables $K\left(\dots ,\cdots \right)$ is called integral kernel of the transform $T$.

### Stochastic kernel

Let $\left({\Omega }_{1},{𝔄}_{1}\right)$ be a measure space, let $\left({\Omega }_{2},{𝔄}_{2}\right)$ be a measurable space.

A map $Q:{\Omega }_{1}×{𝔄}_{2}$ satisfying

(1) $Q\left(-,A\right):{\Omega }_{1}\to \left[0,1\right]$ is ${𝔄}_{1}$ measurable $\forall {A}_{2}\in {𝔄}_{2}$

(2) $Q\left(\omega ,-\right):{𝔄}_{2}\to \left[0,1\right]$ is a probability measure on $\left({\Omega }_{2},{𝔄}_{2}\right)$, $\forall {\omega }_{1}\in {\Omega }_{1}$

is called a stochastic kernel or transition kernel (or Markov kernel - which we avoid since it is confusing) from $\left({\Omega }_{1},{𝔄}_{1}\right)$ to $\left({\Omega }_{2},{𝔄}_{2}\right)$.

Then $Q$ induces a function between the classes of measures on $\left({\Omega }_{1},{𝔄}_{1}\right)$ and on $\left({\Omega }_{2},{𝔄}_{2}\right)$

$\overline{Q}:\left\{\begin{array}{lll}M\left({\Omega }_{1},{𝔄}_{1}\right)& \to & M\left({\Omega }_{2},{𝔄}_{2}\right)\\ \mu & ↦& \left(A↦{\int }_{{\Omega }_{1}}Q\left(-,A\right)d\mu \right)\end{array}$\overline{Q}: \begin{cases} M(\Omega_1, \mathfrak{A}_1)& \to& M(\Omega_2, \mathfrak{A}_2) \\ \mu& \mapsto& (A\mapsto \int_{\Omega_1} Q(-, A) d\mu) \end{cases}

If $\mu$ is a probability measure, then so is $\overline{Q}\left(\mu \right)$. The symbol $Q\left(\omega ,A\right)$ is sometimes written as $Q\left(A\mid \omega \right)$ in optical proximity to a conditional probability.

The stochastic kernel is hence in particular an integral kernel.

In a discrete stochastic process (see below) the transition function is a stochastic kernel (more precisely it is the function $\overline{Q}$ induced by a kernel $Q$).

#### Coupling (Koppelung)

Let $\left({\Omega }_{1},{𝔄}_{1},{P}_{1}\right)$ be a probability space, let $\left({\Omega }_{2},{𝔄}_{2}\right)$ be a measure space, let $Q:{\Omega }_{1}×{𝔄}_{2}\to \left[0,1\right]$ be a stochastic kernel from $\left({\Omega }_{1},{𝔄}_{1},{P}_{1}\right)$ to $\left({\Omega }_{2},{𝔄}_{2}\right)$.

Then by

$P\left(A\right):={\int }_{{\Omega }_{1}}\left({\int }_{{\Omega }_{2}}{1}_{A}\left({\omega }_{1},{\omega }_{2}Q\left({\omega }_{1},{\omega }_{2}\right)\right){P}_{1}\left(d{\omega }_{1}\right)$P(A):=\int_{\Omega_1}(\int_{\Omega_2} 1_A (\omega_1,\omega_2 Q(\omega_1,\omega_2))P_1(d \omega_1)

is defined a probability measure on ${𝔄}_{1}\otimes {𝔄}_{2}$ which is called coupling. $P=:P\otimes Q$ is unique with the property

$P\left({A}_{1}×{A}_{2}\right)={\int }_{{A}_{1}}Q\left({\omega }_{1},{A}_{2}\right){P}_{1}\left(d{\omega }_{1}\right)$P(A_1\times A_2)=\int_{A_1} Q(\omega_1, A_2) P_1(d\omega_1)
###### Theorem

Let (with the above settings) $Y:{\Omega }_{1}\to {\Omega }_{2}$ be $\left({𝔄}_{1},{𝔄}_{2}\right)$-measurable, let $X$ be a $d$-dimensional random vector.

Then there exists a stochastic kernel from $\left({\Omega }_{1},{𝔄}_{1}\right)$ to $\left({ℝ}^{d},ℬ\left(ℝ{\right)}^{d}\right)$ such that

${P}^{X,Y}={P}^{Y}\otimes Q$P^{X,Y}=P^Y\otimes Q

and $Q$ is (a version of) the conditional distribution of $X$ provided $Y$, i.e.

$Q\left(y,-\right)={P}^{X}\left(-\mid Y=y\right)$Q(y,-)=P^X(-|Y=y)

This theorem says that that $Q$ (more precisely $y↦Q\left(y,-\right)$) fits in the diagram

$\begin{array}{ccc}\left({\Omega }_{1},{𝔄}_{1},P\right)& \stackrel{Y}{\to }& \left({\Omega }_{2},{𝔄}_{2},{P}^{Y}\right)\\ {↓}^{X}& & {↙}^{Q}\\ \left(ℝ,ℬ\left(ℝ\right),\lambda \right)\end{array}$\array{ (\Omega_1,\mathfrak{A}_1,P)& \stackrel{Y}{\to}& (\Omega_2,\mathfrak{A}_2, P^Y) \\ \downarrow^X&& \swarrow^{Q} \\ (\mathbb{R},\mathcal{B}(\mathbb{R}),\lambda) }

and $E\left[X\mid Y\right]=Q$.

#### Discrete case

In the discrete case, i.e. if ${\Omega }_{1}$ and ${\Omega }_{2}$ are finite- or enumerable sets, it is possible to reconstruct $Q$ by just considering one-element sets in ${𝔄}_{2}$ and the related probabilities

${p}_{\mathrm{ij}}:=Q\left(i,\left\{j\right\}\right)$p_{ij}:= Q(i,\{j\})

called transition probabilities encoding $Q$ assemble to a (perhaps countably infinite) matrix $M$ called transition matrix of $Q$ resp. of $\overline{Q}$. Note that ${p}_{\mathrm{ij}}$ is the probability of the transition of the state (aka. elementary event or one-element event) $i$ to the event $\left\{j\right\}$ (which in this case happens to have only one element, too). We have ${\sum }_{i}{p}_{\mathrm{ij}}=1$ forall $i\in {\Omega }_{1}$.

If $\rho :=\left({p}_{i}{\right)}_{i\in {\Omega }_{1}}$ is a counting density on ${\Omega }_{1}$, then

$\mathrm{pM}=\left(\sum _{i\in \Omega }{p}_{i}{p}_{\mathrm{ij}}{\right)}_{j\in {\Omega }_{2}}$pM=(\sum_{i\in \Omega} p_i p_{ij})_{j\in \Omega_2}

is a counting density on ${\Omega }_{2}$.

The conditional expectation plays a defining role in the theory of martingales which are stochastic processes such that the conditional expectation of the next value (provided the previous values) equals the present realized value.

### Stochastic processes

The terminology of stochastic processes is a special interpretation of some aspects of infinitary combinatorics? in terms of probability theory.

Let $I$ be a total order (i.e. transitive, antisymmetric, and total).

A stochastic process is a diagram ${X}_{I}:I\to ℛ$ where $ℛ$ is the class of random variables such that ${X}_{I}\left(i\right)=:{X}_{i}:\left({\Omega }_{i},{𝔉}_{i},{P}_{i}\right)\to \left({S}_{i},{𝔖}_{i}\right)$ is a random variable. Often one considers the case where all $\left({S}_{i},{𝔖}_{i}\right)=\left(S,𝔖\right)$ are equal; in this case $S$ is called state space of the process ${X}_{I}$.

If all ${\Omega }_{i}=\Omega$ are equal and the class of $\sigma$-algebras $\left({𝔄}_{i}{\right)}_{i\in I}$ is filtered i.e.

${𝔉}_{i}\subseteq {𝔉}_{j}\phantom{\rule{thickmathspace}{0ex}};\mathrm{iff}\phantom{\rule{thickmathspace}{0ex}};i\le j$\mathfrak{F}_i\subseteq \mathfrak{F}_j\;;iff\;; i\le j

and all ${X}_{l}$ are ${𝔉}_{l}$ measurable, the process is called adapted process.

For example the natural filtration where ${𝔉}_{i}=\sigma \left(\left\{{X}_{l}^{-1}\left(A\right),l\le i,A\in 𝔖\right\}\right)$ gives an adapted process.

In terms of a diagram we have for $i\le j$

$\begin{array}{ccc}\left({\Omega }_{j},{𝔄}_{j},{P}_{j}\right)& \stackrel{f}{\to }& \left({\Omega }_{i},{𝔄}_{i},{P}_{i}\right)\\ {↓}^{{X}_{j}}& & {↙}^{{\omega }_{i}↦Q\left({\omega }_{i},-\right)}\\ \left(ℝ,ℬ\left(ℝ\right),\lambda \right)\end{array}$\array{ (\Omega_j,\mathfrak{A}_j,P_j)& \stackrel{f}{\to}& (\Omega_i,\mathfrak{A}_i,P_i) \\ \downarrow^{X_j}&& \swarrow^{\omega_i\mapsto Q(\omega_i,-)} \\ (\mathbb{R},\mathcal{B}(\mathbb{R}),\lambda) }

and $\overline{Q}:\left({\Omega }_{i},{𝔄}_{i},{P}_{i}\right)\to \left({\Omega }_{j},{𝔄}_{j},{P}_{j}\right)$ where $Q:{\Omega }_{i}×{𝔄}_{j}\to \left[0,1\right]$ is the transition probability for the passage from state $i$ to state $j$.

### Martingale

An adapted stochastic process with the natural filtration in discrete time is called a martingale if all $E\left[{X}_{i}\right]<\infty$ and $\forall i\le j,E\left[{X}_{j}\mid {𝔄}_{i}\right]={X}_{i}$.

$\begin{array}{ccc}\left({\Omega }_{j},{𝔄}_{j},{P}_{j}\right)& \stackrel{f}{\to }& \left({\Omega }_{i},{𝔄}_{i},{P}_{i}\right)\\ {↓}^{{X}_{j}}& & {↙}^{E\left[{X}_{j}\mid {𝔄}_{i}\right]={X}_{i}}\\ \left(ℝ,ℬ\left(ℝ\right),\lambda \right)\end{array}$\array{ (\Omega_j,\mathfrak{A}_j,P_j)& \stackrel{f}{\to}& (\Omega_i,\mathfrak{A}_i,P_i) \\ \downarrow^{X_j}&& \swarrow^{E[X_j|\mathfrak{A}_i]=X_i} \\ (\mathbb{R},\mathcal{B}(\mathbb{R}),\lambda) }

martingale?

(…)

### Markow Process

$P\left({X}_{t}\mid {𝔄}_{s}\right)=P\left({X}_{t}\mid {X}_{s}\right)\phantom{\rule{thickmathspace}{0ex}};\forall s\le t$P(X_t|\mathfrak{A}_s)=P(X_t|X_s)\;;\forall s\le t

is called a Markow process.

## Chapman-Kolmogorow Equation

For a Markow process the Chapman-Kolmogorow equation encodes the statement that the transition probabilities of the process form a semigroup.

If in the notation from above $\left({P}_{t}:\Omega ×𝔄\to \left[0,1\right]{\right)}_{t}$ is a family of stochastic kernels $\left(\Omega ,𝔄\right)\to \left(\Omega ,𝔄\right)$ such that all ${P}_{t}\left(\omega ,-\right):𝔄\to \left[0,1\right]$ are probabilities, then $\left({P}_{t}{\right)}_{t}$ is called transition semigroup if

${\overline{P}}_{t}\left({P}_{s}\left(\omega ,A\right)\right)={P}_{s+t}\left(\omega ,A\right)$\overline P_t (P_s(\omega,A))=P_{s+t} (\omega, A)

where

${\overline{P}}_{t}:{P}_{s}\left(\omega ,-\right)↦\left(A↦{\int }_{\Omega }{P}_{t}\left(y,A\right){P}_{s}\left(\omega ,-\right)\left({d}_{y}\right)\right)$\overline P_t: P_s(\omega,-)\mapsto (A\mapsto\int_\Omega P_t (y,A) P_s(\omega,-)(d_y))

Revised on October 31, 2012 18:28:49 by Stephan Alexander Spahn (79.227.163.114)