nLab
conditional expectation

Contents

This article is under construction.

Idea

If (Ξ©,𝔄,P)(\Omega,\mathfrak{A},P) is a probability space, the conditional expectation E[X|Ξ£]E[X|\Sigma] of a (measurable) random variable XX with respect to some sub-Οƒ\sigma-algebra Ξ£βŠ†π”„\Sigma\subseteq \mathfrak{A} is some measurable random variable which is a β€˜β€™coarsenedβ€™β€˜ version of XX. We can think of E[X|Ξ£]E[X|\Sigma] as a random variable with the same domain but which is measured with a sigma algebra containing only restricted information on the original event since to some events in 𝔄\mathfrak{A} has been assigned probability 11 or 00 in a consistent way.

Conditional expectation relative to a random variable

Let (Ξ©,𝔄,P)(\Omega,\mathfrak{A},P) be a probability space, let YY be a measurable function into a measure space (U,Ξ£,P Y)(U,\Sigma,P^Y) equipped with the pushforward measure? induced by YY, let X:(Ξ©,𝔄,P)β†’(ℝ,ℬ(ℝ),Ξ»)X:(\Omega,\mathfrak{A},P)\to(\mathbb{R},\mathcal{B}(\mathbb{R}), \lambda) be a real-valued random variable?.

Then for XX and YY there exists a essentially unique (two sets are defined to be equivalent if their difference is a set of measure 00) integrable function g=:E[X|Y]g=:E[X|Y] such that the following diagram commutes:

(Ξ©,𝔄,P) β†’Y (U,Ξ£,P Y) ↓ X ↙ g=:E[X|Y] (ℝ,ℬ(ℝ),Ξ») \array{ (\Omega,\mathfrak{A},P)& \stackrel{Y}{\to}& (\U, \Sigma, P^Y) \\ \downarrow^{\mathrlap{X}} && \swarrow_{\mathrlap{g=:E[X|Y]}} \\ (\mathbb{R},\mathcal{B}(\mathbb{R}),\lambda) }

where g:y↦E[X|Y=y]g:y\mapsto E[X|Y=y]. Here β€˜β€™commutesβ€™β€˜ shall mean that

(1) gg is Ξ£\Sigma-measurable.

(2) the integrals over XX and g∘Yg\circ Y are equal.

In this case g=E[X|Y]g=E[X|Y] is called a version of the conditional expectation of XX provided YY.

In more detail (2) is equivalent to that for all B∈ΣB\in \Sigma we have

∫ Y βˆ’1(B)X(Ο‰)dP(Ο‰)=∫ Bg(u)dP Y(u)\int_{Y^{-1}(B)}X(\omega)d P(\omega)=\int_B g(u)d P^Y (u)

and to

∫ Y βˆ’1(B)X(Ο‰)dP(Ο‰)=∫ Y βˆ’1(B)(g∘Y)(Ο‰)dP(Ο‰)\int_{Y^{-1}(B)}X(\omega)d P(\omega)=\int_{Y^{-1}(B)}(g\circ Y)(\omega)d P (\omega)

(The equivalence of the last two formulas is given since we always have ∫ Bg(u)dP Y(u)=∫ Y βˆ’1(B)(g∘Y)(Ο‰)dP(Ο‰)\int_B g(u)d P^Y (u)=\int_{Y^{-1}(B)} (g\circ Y)(\omega)d P (\omega) by the substitution rule.)

Note that it does not follow from the preceding definition that the conditional expectation exists. This is a consequence of the Radon-Nikodym theorem as will be shown in the following section. (Note that the argument of the theorem applies to the definition of the conditional expectation by random variables if we consider the pushforward measure? as given by a sub-Οƒ\sigma-algebra of the original one. In this sense E[X|Y]E[X|Y] is a β€˜β€™coarsened versionβ€™β€˜ of XX factored by the information (i.e. the Οƒ\sigma-algebra) given by YY.)

Conditional expectation relative to a sub-Οƒ\sigma-algebra

Note that by construction of the pushforward-measure it suffices to define the conditional expectation only for the case where Ξ£:=π”–βŠ†π”„\Sigma:=\mathfrak{S}\subseteq \mathfrak{A} is a sub-Οƒ\sigma-algebra.

(Note that we loose information with the notation P YP^Y; e.g P 𝔄 idP^{id}_\mathfrak{A} is different from P 𝔖 idP^{id}_\mathfrak{S})

The diagram

(Ξ©,𝔄,P) β†’id (Ξ©,𝔖,P id) ↓ X ↙ Z=:E[X|𝔖] (ℝ,ℬ(ℝ),Ξ»)\array{ (\Omega,\mathfrak{A},P)& \stackrel{id}{\to}& (\Omega, \mathfrak{S}, P^{id}) \\ \downarrow^X&& \swarrow^{Z=:E[X|\mathfrak{S}]} \\ (\mathbb{R},\mathcal{B}(\mathbb{R}),\lambda) }

is commutative (in our sense) iff

(a) ZZ is 𝔖\mathfrak{S}-measurable

(b) ∫ AZdP=∫ AXdP\int_A Z d P=\int_A X d P, βˆ€Aβˆˆπ”–\forall A\in \mathfrak{S}

We hence can write the conditional expectation as the equivalence class

E[X|𝔖]={Z∈L 1(Ξ©,F,P)|∫ AZdP=∫ AXdPβˆ€Aβˆˆπ”–}E[X|\mathfrak{S}]=\{Z\in L^1 (\Omega, F,P)|\int_A ZdP=\int_A XdP\;\forall A\in \mathfrak{S}\}

An element of this class is also called a version.

Theorem

E[X|𝔖]E[X|\mathfrak{S}] exists and is unique almost surely.

Proof

Existence: By

Q(A):=∫ AX(Ο‰)P(dΟ‰)Q(A):=\int_A X(\omega)P(d\omega)

Aβˆˆπ”„A\in \mathfrak{A} is defined a measure QQ on (Ξ©,𝔄,P)(\Omega,\mathfrak{A},P) (if Xβ‰₯0X\ge 0; if not consider the positive part X +X^+ and the negative part X βˆ’X^- of X=X +βˆ’X βˆ’X=X^+ -X^- separate and use linearity of the integral). Let P| 𝔖P|_{\mathfrak{S}} be the restriction of PP to 𝔖\mathfrak{S}. Then

Q<<P| 𝔖Q\lt\lt P|_{\mathfrak{S}}

meaning: P| 𝔖(M)=0β‡’Q(M)=0P|_{\mathfrak{S}}(M)=0\Rightarrow Q(M)=0 for all Mβˆˆπ”–M\in\mathfrak{S}. This is the condition of the theorem of Radon-Nikodym (the other condition of the theorem that P| 𝔖P|_{\mathfrak{S}} is Οƒ\sigma-finite is satisfied since PP is a probability measure). The theorem implies that QQ has a density w.r.t P| 𝔖P|_{\mathfrak{S}} which is E[X|𝔖]E[X|\mathfrak{S}].

Uniqueness: If gg and g β€²g^\prime are candidates, by linearity the integral over their difference is zero.

Conditional probability

From elementary probability theory we know that P(A)=E[1 A]P(A)=E[1_A].

For Aβˆˆπ”–A\in \mathfrak{S} we call P(A|𝔖):=E[1 A|𝔖]P(A|\mathfrak{S}):=E[1_A|\mathfrak{S}] the conditional probability of AA provided BB.

Conditional distribution, Conditional density

Integral kernel, Stochastic kernel

In probability theory and statistics, a stochastic kernel is the transition function of a stochastic process. In a discrete time process with continuous probability distributions, it is the same thing as the kernel of the integral operator that advances the probability density function.

Integral kernel

An integral transform TT is an assignation of the form

(Tf)(u)=∫K(t,u)f(t)dt(Tf)(u)=\int K(t,u)f(t)dt

where the function of two variables K(…,β‹―)K( \dots ,\cdots) is called integral kernel of the transform TT.

Stochastic kernel

Let (Ξ© 1,𝔄 1)(\Omega_1,\mathfrak{A}_1) be a measure space, let (Ξ© 2,𝔄 2)(\Omega_2,\mathfrak{A}_2) be a measurable space.

A map Q:Ξ© 1×𝔄 2Q: \Omega_1\times \mathfrak{A}_2 satisfying

(1) Q(βˆ’,A):Ξ© 1β†’[0,1]Q(-, A):\Omega_1\to [0,1] is 𝔄 1\mathfrak{A}_1 measurable βˆ€A 2βˆˆπ”„ 2\forall A_2\in \mathfrak{A}_2

(2) Q(Ο‰,βˆ’):𝔄 2β†’[0,1]Q(\omega,-):\mathfrak{A}_2\to [0,1] is a probability measure on (Ξ© 2,𝔄 2)(\Omega_2,\mathfrak{A}_2), βˆ€Ο‰ 1∈Ω 1\forall \omega_1\in \Omega_1

is called a stochastic kernel or transition kernel (or Markov kernel - which we avoid since it is confusing) from (Ξ© 1,𝔄 1)(\Omega_1,\mathfrak{A}_1) to (Ξ© 2,𝔄 2)(\Omega_2,\mathfrak{A}_2).

Then QQ induces a function between the classes of measures on (Ξ© 1,𝔄 1)(\Omega_1, \mathfrak{A}_1) and on (Ξ© 2,𝔄 2)(\Omega_2, \mathfrak{A}_2)

QΒ―:{M(Ξ© 1,𝔄 1) β†’ M(Ξ© 2,𝔄 2) ΞΌ ↦ (Aβ†¦βˆ« Ξ© 1Q(βˆ’,A)dΞΌ)\overline{Q}: \begin{cases} M(\Omega_1, \mathfrak{A}_1)& \to& M(\Omega_2, \mathfrak{A}_2) \\ \mu& \mapsto& (A\mapsto \int_{\Omega_1} Q(-, A) d\mu) \end{cases}

If ΞΌ\mu is a probability measure, then so is QΒ―(ΞΌ)\overline{Q}(\mu). The symbol Q(Ο‰,A)Q(\omega, A) is sometimes written as Q(A|Ο‰)Q(A|\omega) in optical proximity to a conditional probability.

The stochastic kernel is hence in particular an integral kernel.

In a discrete stochastic process (see below) the transition function is a stochastic kernel (more precisely it is the function QΒ―\overline{Q} induced by a kernel QQ).

Coupling (Koppelung)

Let (Ξ© 1,𝔄 1,P 1)(\Omega_1,\mathfrak{A}_1, P_1) be a probability space, let (Ξ© 2,𝔄 2)(\Omega_2,\mathfrak{A}_2) be a measure space, let Q:Ξ© 1×𝔄 2β†’[0,1]Q:\Omega_1\times \mathfrak{A}_2\to [0,1] be a stochastic kernel from (Ξ© 1,𝔄 1,P 1)(\Omega_1,\mathfrak{A}_1, P_1) to (Ξ© 2,𝔄 2)(\Omega_2,\mathfrak{A}_2).

Then by

P(A):=∫ Ξ© 1(∫ Ξ© 21 A(Ο‰ 1,Ο‰ 2Q(Ο‰ 1,Ο‰ 2))P 1(dΟ‰ 1)P(A):=\int_{\Omega_1}(\int_{\Omega_2} 1_A (\omega_1,\omega_2 Q(\omega_1,\omega_2))P_1(d \omega_1)

is defined a probability measure on 𝔄 1βŠ—π”„ 2\mathfrak{A}_1\otimes\mathfrak{A}_2 which is called coupling. P=:PβŠ—QP=:P\otimes Q is unique with the property

P(A 1Γ—A 2)=∫ A 1Q(Ο‰ 1,A 2)P 1(dΟ‰ 1)P(A_1\times A_2)=\int_{A_1} Q(\omega_1, A_2) P_1(d\omega_1)
Theorem

Let (with the above settings) Y:Ξ© 1β†’Ξ© 2Y:\Omega_1\to \Omega_2 be (𝔄 1,𝔄 2)(\mathfrak{A}_1,\mathfrak{A}_2)-measurable, let XX be a dd-dimensional random vector.

Then there exists a stochastic kernel from (Ξ© 1,𝔄 1)(\Omega_1, \mathfrak{A}_1) to (ℝ d,ℬ(ℝ) d)(\mathbb{R}^d,\mathcal{B}(\mathbb{R})^d) such that

P X,Y=P YβŠ—QP^{X,Y}=P^Y\otimes Q

and QQ is (a version of) the conditional distribution of XX provided YY, i.e.

Q(y,βˆ’)=P X(βˆ’|Y=y)Q(y,-)=P^X(-|Y=y)

This theorem says that that QQ (more precisely y↦Q(y,βˆ’)y\mapsto Q(y,-)) fits in the diagram

(Ξ© 1,𝔄 1,P) β†’Y (Ξ© 2,𝔄 2,P Y) ↓ X ↙ Q (ℝ,ℬ(ℝ),Ξ»)\array{ (\Omega_1,\mathfrak{A}_1,P)& \stackrel{Y}{\to}& (\Omega_2,\mathfrak{A}_2, P^Y) \\ \downarrow^X&& \swarrow^{Q} \\ (\mathbb{R},\mathcal{B}(\mathbb{R}),\lambda) }

and E[X|Y]=QE[X|Y]=Q.

Discrete case

In the discrete case, i.e. if Ξ© 1\Omega_1 and Ξ© 2\Omega_2 are finite- or enumerable sets, it is possible to reconstruct QQ by just considering one-element sets in 𝔄 2\mathfrak{A}_2 and the related probabilities

p ij:=Q(i,{j})p_{ij}:= Q(i,\{j\})

called transition probabilities encoding QQ assemble to a (perhaps countably infinite) matrix MM called transition matrix of QQ resp. of QΒ―\overline{Q}. Note that p ijp_{ij} is the probability of the transition of the state (aka. elementary event or one-element event) ii to the event {j}\{j\} (which in this case happens to have only one element, too). We have βˆ‘ ip ij=1\sum_i p_{ij}=1 forall i∈Ω 1i\in \Omega_1.

If ρ:=(p i) i∈Ω 1\rho:=(p_i)_{i\in \Omega_1} is a counting density on Ω 1\Omega_1, then

pM=(βˆ‘ i∈Ωp ip ij) j∈Ω 2pM=(\sum_{i\in \Omega} p_i p_{ij})_{j\in \Omega_2}

is a counting density on Ξ© 2\Omega_2.

The conditional expectation plays a defining role in the theory of martingales which are stochastic processes such that the conditional expectation of the next value (provided the previous values) equals the present realized value.

Stochastic processes

The terminology of stochastic processes is a special interpretation of some aspects of infinitary combinatorics? in terms of probability theory.

Let II be a total order (i.e. transitive, antisymmetric, and total).

A stochastic process is a diagram X I:Iβ†’β„›X_I: I\to \mathcal{R} where β„›\mathcal{R} is the class of random variables such that X I(i)=:X i:(Ξ© i,𝔉 i,P i)β†’(S i,𝔖 i)X_I(i)=:X_i:(\Omega_i, \mathfrak{F}_i, P_i)\to (S_i, \mathfrak{S}_i) is a random variable. Often one considers the case where all (S i,𝔖 i)=(S,𝔖)(S_i, \mathfrak{S}_i)=(S, \mathfrak{S}) are equal; in this case SS is called state space of the process X IX_I.

If all Ξ© i=Ξ©\Omega_i=\Omega are equal and the class of Οƒ\sigma-algebras (𝔄 i) i∈I(\mathfrak{A}_i)_{i\in I} is filtered i.e.

𝔉 iβŠ†π”‰ j;iff;i≀j\mathfrak{F}_i\subseteq \mathfrak{F}_j\;;iff\;; i\le j

and all X lX_l are 𝔉 l\mathfrak{F}_l measurable, the process is called adapted process.

For example the natural filtration where 𝔉 i=Οƒ({X l βˆ’1(A),l≀i,Aβˆˆπ”–})\mathfrak{F}_i=\sigma(\{X^{-1}_l(A), l\le i, A\in \mathfrak{S}\}) gives an adapted process.

In terms of a diagram we have for i≀ji\le j

(Ξ© j,𝔄 j,P j) β†’f (Ξ© i,𝔄 i,P i) ↓ X j ↙ Ο‰ i↦Q(Ο‰ i,βˆ’) (ℝ,ℬ(ℝ),Ξ»)\array{ (\Omega_j,\mathfrak{A}_j,P_j)& \stackrel{f}{\to}& (\Omega_i,\mathfrak{A}_i,P_i) \\ \downarrow^{X_j}&& \swarrow^{\omega_i\mapsto Q(\omega_i,-)} \\ (\mathbb{R},\mathcal{B}(\mathbb{R}),\lambda) }

and QΒ―:(Ξ© i,𝔄 i,P i)β†’(Ξ© j,𝔄 j,P j)\overline{Q}:(\Omega_i,\mathfrak{A}_i,P_i)\to(\Omega_j,\mathfrak{A}_j,P_j) where Q:Ξ© i×𝔄 jβ†’[0,1]Q:\Omega_i\times\mathfrak{A}_j\to [0,1] is the transition probability for the passage from state ii to state jj.

Martingale

An adapted stochastic process with the natural filtration in discrete time is called a martingale if all E[X i]<∞E[X_i]\lt \infty and βˆ€i≀j,E[X j|𝔄 i]=X i\forall i\le j, E[X_j|\mathfrak{A}_i]=X_i.

(Ξ© j,𝔄 j,P j) β†’f (Ξ© i,𝔄 i,P i) ↓ X j ↙ E[X j|𝔄 i]=X i (ℝ,ℬ(ℝ),Ξ»)\array{ (\Omega_j,\mathfrak{A}_j,P_j)& \stackrel{f}{\to}& (\Omega_i,\mathfrak{A}_i,P_i) \\ \downarrow^{X_j}&& \swarrow^{E[X_j|\mathfrak{A}_i]=X_i} \\ (\mathbb{R},\mathcal{B}(\mathbb{R}),\lambda) }

martingale?

(…)

Markow Process

An adapted stochastic process satisfying

P(X t|𝔄 s)=P(X t|X s);βˆ€s≀tP(X_t|\mathfrak{A}_s)=P(X_t|X_s)\;;\forall s\le t

is called a Markow process.

Chapman-Kolmogorow Equation

For a Markow process the Chapman-Kolmogorow equation encodes the statement that the transition probabilities of the process form a semigroup.

If in the notation from above (P t:Ω×𝔄→[0,1]) t(P_t:\Omega\times\mathfrak{A}\to [0,1])_t is a family of stochastic kernels (Ξ©,𝔄)β†’(Ξ©,𝔄)(\Omega,\mathfrak{A})\to(\Omega,\mathfrak{A}) such that all P t(Ο‰,βˆ’):𝔄→[0,1]P_t(\omega,-):\mathfrak{A}\to [0,1] are probabilities, then (P t) t(P_t)_t is called transition semigroup if

PΒ― t(P s(Ο‰,A))=P s+t(Ο‰,A)\overline P_t (P_s(\omega,A))=P_{s+t} (\omega, A)

where

PΒ― t:P s(Ο‰,βˆ’)↦(Aβ†¦βˆ« Ξ©P t(y,A)P s(Ο‰,βˆ’)(d y))\overline P_t: P_s(\omega,-)\mapsto (A\mapsto\int_\Omega P_t (y,A) P_s(\omega,-)(d_y))

Revised on October 31, 2012 18:28:49 by Stephan Alexander Spahn (79.227.163.114)