Multivariate Normal Distribution#

from IPython.display import display, HTML

display(HTML(“””

“””))

display(HTML(“””

“””))

Probability defined over vectors and the MVN#

Up until this point we have discussed assigning probabilities to a single random variable, or to a sequence of random variables such that every random variable in the sequence is independent of one another. We have not yet discussed (the very real) case that two or more random variables are related to one another and how to express these relationships.

Given a set of random variables \(X_{1}, X_{2}, \cdots, X_{n}\), a succint approach for defining probabilities to these random variables is by collecting them into a vector

(215)#\[\begin{align} X = \begin{bmatrix} X_{1} \\ X_{1} \\ \vdots X_{n} \\ \end{bmatrix} \end{align}\]

and assigning probabilties on the space \(\mathbb{R}^{n}\). In other words, we will begin to explore mathematical statements like

(216)#\[\begin{align} P(X = [1,2,3]^{T}) \end{align}\]

, or “What is the probability that the first random variable \(X_{1}\) equals the value 1 and the first random variable \(X_{2}\) equals the value 2 and the third random variable \(X_{3}\) equals the value 3 “

Before we begin a serious discussion of probability assignments like this, we need mathematical tools from linear algebra.

Review of vector and matrix algebra#

Vectors#

Vectors are the fundemental mathemtical object that we will work with in this unit. Vectors are defined as a list of items where each item is an element from the real line. For example

(217)#\[\begin{align} v_{1} = \begin{bmatrix} 1 \\ -1\\ 1/2\\ \end{bmatrix} \end{align}\]

is the vector \(v_{1}\) that contains the value one in the first position, value negative one in the second position and the value one-half in the third position.

Vectors can interact with elements from \(\mathbb{R}\) called scalars as well as interact with one another.

\(\alpha x\) and \(\alpha + x\)#

Given a scalar \(\alpha\), vector \(x \in \mathbb{R}^{n}\), then the new vector \(\alpha x\) is defined as

(218)#\[\begin{align} \alpha x = \begin{bmatrix} \alpha x_{1}\\ \alpha x_{2}\\ \vdots \\ \alpha x_{n} \end{bmatrix} \end{align}\]

and the new vector \(\alpha + x\) is defined as

(219)#\[\begin{align} \alpha+x = \begin{bmatrix} \alpha + x_{1}\\ \alpha + x_{2}\\ \vdots \\ \alpha + x_{n} \end{bmatrix} \end{align}\]

\(x+y\) and \(xy\)#

Given vectors \(x\) and \(y\) both in \(\mathbb{R}^{n}\) then the new vector \(x+y\) is defined as

(220)#\[\begin{align} x+y = \begin{bmatrix} x_{1} + y_{1}\\ x_{2} + y_{2}\\ \vdots \\ x_{n} + y_{n} \end{bmatrix} \end{align}\]

and the new vector \(xy\) is defined as

(221)#\[\begin{align} xy = \begin{bmatrix} x_{1} y_{1}\\ x_{2} y_{2}\\ \vdots \\ x_{n} y_{n} \end{bmatrix} \end{align}\]

An approach to defining a set of vectors#

On occasion it is important to describe a set of vectors, called a vector space. The canonical approach to describe a set of vectors is to generate all vectors in a vector space \(V\) as multiplicative and additive combinations of some core set of vectors. In otherwords, we can generate a set of vectors by picking a small number of vectors \(y_{1}, y_{2}, y_{3}\) and building a vector space \(V\) as

(222)#\[\begin{align} V = \left \{ \alpha_{1} y_{1} + \alpha_{2} y_{2} + \alpha_{3} y_{3} | \alpha_{1}, \alpha_{2}, \alpha_{3} \in \mathbb{R} ; y_{1}, y_{2} ,y_{3} \in \mathbb{R}^{n} \right \} \end{align}\]

For example, we may decide to generate the vector space \(V\) using these two vectors:

(223)#\[\begin{align} y_{1} = \begin{bmatrix} 1 \\ 2 \\ \end{bmatrix} \; \; % y_{2} = \begin{bmatrix} -1 \\ 3 \\ \end{bmatrix} \end{align}\]

Then in the space we created contains vectors like

(224)#\[\begin{align} \begin{bmatrix} -1 \\ 3 \end{bmatrix} &= 0\begin{bmatrix} 1 \\ 2 \end{bmatrix} + 1\begin{bmatrix} -1 \\ 3 \end{bmatrix} \\ \end{align}\]

or the vector

(225)#\[\begin{align} \begin{bmatrix} 1 \\ 2 \end{bmatrix} &= 1\begin{bmatrix} 1 \\ 2 \end{bmatrix} + 0\begin{bmatrix} -1 \\ 3 \end{bmatrix} \end{align}\]

or

(226)#\[\begin{align} \begin{bmatrix} 2 \\ 4 \end{bmatrix} & = 2\begin{bmatrix} 1 \\ 2 \end{bmatrix} + 0\begin{bmatrix} -1 \\ 3 \end{bmatrix} \end{align}\]

One can imagine defining then different vector spaces, and so then it is natural to ask about functions from one vector space to a second.

In otherwords, we can define functions over vectors like

(227)#\[\begin{align} f(v) = \sin(v_{1}/2) + \left(v_{2}\right)^{2} - \sqrt{v_{3}} \end{align}\]

However, an important class of functions are linear functions. A function is linear if

(228)#\[\begin{align} f(\alpha x) &= \alpha f(x) \\ f(x + y) &= f(x) + f(y) \end{align}\]

As it turns out, linear functions have a unique representation. Consider a vector space \(V\) generated by the two vectors \(x_{1}\) and \(x_{2}\). In addition, consider a function \(g\) that is linear and maps the vector space \(V\) to the new space \(Q\).

Then for a vector \(x\) in the space \(V\), we can write

(229)#\[\begin{align} x = \alpha_{1} x_{1} + \alpha_{2} x_{2} \end{align}\]

If we wanted to apply the function \(g\) to the vector \(x\) then

(230)#\[\begin{align} g(x) = g\left(\alpha_{1} x_{1} + \alpha_{2} x_{2}\right) \end{align}\]

However, remember that \(g\) is a linear function. This means that we can simplify the above as

(231)#\[\begin{align} g(x) &= g\left(\alpha_{1} x_{1} + \alpha_{2} x_{2}\right) \\ &= g\left(\alpha_{1} x_{1}\right) + g\left(\alpha_{2} x_{2}\right) \\ &= \alpha_{1} g\left(x_{1}\right) + \alpha_{2} g\left(x_{2}\right) \\ \end{align}\]

But, because \(g\) maps vectors from \(V\) to \(Q\) these new vectors \(g\left(x_{1}\right)\) and \(g\left(x_{2}\right)\) are members of \(Q\). If we suppose that every vector in \(Q\) can be generated from (say) the three vectors \(q_{1}, q_{2}, q_{3}\) then we can rewrite \(g\left(x_{1}\right)\) in terms of these three \(q\)s as well as \(g\left(x_{2}\right)\).

In other words,

(232)#\[\begin{align} g(x) &= \alpha_{1} g\left(x_{1}\right) + \alpha_{2} g\left(x_{2}\right) \\ &= \alpha_{1} \left( \beta^{1}_{1}q_{1} +\beta^{2}_{1}q_{2} +\beta^{3}_{1}q_{3} \right) + \alpha_{2} \left( \beta^{1}_{2}q_{1} +\beta^{2}_{2}q_{2} +\beta^{3}_{2}q_{3} \right) \\ &= \left( \alpha_{1} \beta^{1}_{1} + \alpha_{2} \beta^{1}_{2} \right) q_{1} + \left( \alpha_{1} \beta^{2}_{1} + \alpha_{2} \beta^{2}_{2} \right) q_{2} + \left( \alpha_{1} \beta^{3}_{1} + \alpha_{2} \beta^{3}_{2} \right) q_{3} \end{align}\]

or \(g(x)\) can be represented as the vector

(233)#\[\begin{align} g(x) = \begin{bmatrix} \alpha_{1} \beta^{1}_{1} + \alpha_{2} \beta^{1}_{2} \\ \alpha_{1} \beta^{2}_{1} + \alpha_{2} \beta^{2}_{2} \\ \alpha_{1} \beta^{3}_{1} + \alpha_{2} \beta^{3}_{2} \end{bmatrix} \end{align}\]

Matrices#

A matrix is defined as a stack or ordered collection of vectors. Matrices are written as a rectangular grid of values such as

(234)#\[\begin{align} A = \begin{bmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23}\\ a_{31} & a_{32} & a_{33}\\ a_{41} & a_{42} & a_{43}\\ \end{bmatrix} \end{align}\]

We can refer to individual elements inside the matrix using this notation \(A_{ij}\). This refers to the element located in row \(i\) and column \(j\). For example, \(A_{24}\) is the value \(a_{24}\) above. We can refer to an entire row of \(A\) usign this notation \(A_{(i,:)}\) and an entire column of \(A\) using this notation \(A_{(:,j)}\).

Ax#

The matrix \(A\) with \(m\) rows and \(n\) columns, referred to as its dimension, maps a vector of length \(n\) to a new vector of length \(m\). From our discussion of linear functions above, we see that, for example,

(235)#\[\begin{align} Ax = \begin{bmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23}\\ a_{31} & a_{32} & a_{33}\\ a_{41} & a_{42} & a_{43} \end{bmatrix} \begin{bmatrix} x_{1} \\ x_{2} \\ x_{3} \end{bmatrix} = \begin{bmatrix} a_{11} x_{1} + a_{12} x_{2} + a_{13}x_{3} \\ a_{21} x_{1} + a_{22} x_{2} + a_{23}x_{3} \\ a_{31} x_{1} + a_{32} x_{2} + a_{33}x_{3} \\ a_{41} x_{1} + a_{42} x_{2} + a_{43}x_{3} \\ \end{bmatrix} = A_{:,1}x_{1} +A_{:,2}x_{2} +A_{:,3}x_{3} +A_{:,4}x_{4} \end{align}\]

, or we multiply the first column in \(A\) by the first element in \(x\) plus the second column in \(A\) with the second \(element\) in \(x\) and so on.

Another approach towards matrix multiplication uses the inner product. Given a vector \(x\) and \(y\), the inner product is defined as

(236)#\[\begin{align} x'y = x^{T}y = \sum_{i} x_{i}y_{i} \end{align}\]

It is important to note some properties of the inner product. The inner product is a linear operator or, for constant \(\alpha\),

(237)#\[\begin{align} (\alpha x)' y &= \alpha x'y \\ (x+y)' z &= x'z + y'z \end{align}\]

These properties come from the fact that the inner product is a summation over the individual elements in \(x\) and \(y\). With the inner product, we can write matrix multiplication using the rows of \(A\) as

(238)#\[\begin{align} Ax = \begin{bmatrix} A_{1,:}'x \\ A_{2,:}'x \\ A_{3,:}'x \\ A_{4,:}'x \\ \end{bmatrix} \end{align}\]

AB#

Recall that the matrix \(A\) was a represenation of a linear function that input a vector \(x\) and output a vector \(y\). Another common operation is function composition, and function composition can be represented as the product of two matrices.

Consider a space \(V\) that is generated from vectors \(x_{1} and x_{2}\) that are each length 3 vectors. In addition, consider two functions, \(g\) and \(f\), that map a vector \(x\) from \(V\) onto a new space. We’ll assume the new space created by \(g(x)\) for all \(x\) is called \(G\) and the corresponsing space for \(f\) is called \(F\).

Then function composition results in

(239)#\[\begin{align} f( g(x) ) &= f( g( \alpha_{1}x_{1} + \alpha_{2}x_{2} + \alpha_{3}x_{3} ) ) \\ &= f( \alpha_{1} g(x_{1}) + \alpha_{2} g(x_{2}) + \alpha_{3} g(x_{3}) ) \\ &= \alpha_{1} f[ g(x_{1}) ] + \alpha_{2} f[ g(x_{2}) ] + \alpha_{3} f[ g(x_{3}) ) ] \\ \end{align}\]

When we look at \( f[ g(x_{1}) ]\) we notice that \(g(x_{1})\) can be represented by its own set of vectors, say, \(y_{1}\) and \(y_{2}\). Then

(240)#\[\begin{align} \alpha_{1} f[ g(x_{1}) ] = \alpha_{1} f[ \beta^{1}_{1} y_{1} + \beta^{1}_{2} y_{2} ] \end{align}\]

and we can further assume \(f(x)\) also has its own set of vectors that can be used to represent it. Lets say these three vectors \(z_{1}, z_{2}, z_{3}\). Then

(241)#\[\begin{align} \alpha_{1} f[ g(x_{1}) ] &= \alpha_{1} f[ \beta^{1}_{1} y_{1} + \beta^{1}_{2} y_{2} ] \\ &= \alpha_{1} \left(\beta^{1}_{1} f[y_{1}] + \beta^{1}_{2} f[y_{2}]\right) \\ &= \alpha_{1} \left(\beta^{1}_{1} ( \gamma^{1}_{1}z_{1} + \gamma^{1}_{2}z_{2} + \gamma^{1}_{3}z_{3} ) + \beta^{1}_{2} ( \gamma^{2}_{1}z_{1} + \gamma^{2}_{2}z_{2} + \gamma^{2}_{3}z_{3} )\right) \\ \end{align}\]

Again, our goal is to represent this vector using the three \(z\)s and so we want to group by \(z_{1}\),\(z_{2}\), and \(z_{3}\).

(242)#\[\begin{align} \alpha_{1} \left\{ ( \beta^{1}_{1}\gamma^{1}_{1} + \beta^{1}_{2} \gamma^{2}_{1}) z_{1} + ( \beta^{1}_{1}\gamma^{1}_{2} + \beta^{1}_{2} \gamma^{2}_{2}) z_{2} + ( \beta^{1}_{1}\gamma^{1}_{3} + \beta^{1}_{2} \gamma^{2}_{3}) z_{3} \right\} \end{align}\]

Lets use an inner product to rewrite the above,

(243)#\[\begin{align} \alpha_{1} \left\{ ( \beta^{1}_{1}\gamma^{1}_{1} + \beta^{1}_{2} \gamma^{2}_{1}) z_{1} + ( \beta^{1}_{1}\gamma^{1}_{2} + \beta^{1}_{2} \gamma^{2}_{2}) z_{2} + ( \beta^{1}_{1}\gamma^{1}_{3} + \beta^{1}_{2} \gamma^{2}_{3}) z_{3} \right\} &= (G \beta_{:,1}) \alpha_{1} \end{align}\]

where \begin{align} \beta_{:,1} = \begin{bmatrix} \beta^{1}{1}\ \beta^{1}{2} \end{bmatrix} ; ; ; G = \begin{bmatrix} \gamma^{1}{1} & \gamma^{2}{1} \ \gamma^{1}{2} & \gamma^{2}{2}\ \gamma^{1}{3} & \gamma^{2}{3}\ \end{bmatrix} \end{align}

In the same way, we would simplify \(\alpha_{2} f[ g(x_{2}) ]\) and find that

(244)#\[\begin{align} f[ g(x_{2}) ] = (G \beta_{:,2}) \alpha_{2} \\ \beta_{:,2} = \begin{bmatrix} \beta^{2}_{1}\\ \beta^{2}_{2} \end{bmatrix} \end{align}\]

and so then

(245)#\[\begin{align} \alpha_{1} f(g(x_{1})) + \alpha_{2} f(g(x{2})) &= \alpha_{1} G \beta_{:,1} + \alpha_{2} G \beta_{:,2} \\ &=\begin{bmatrix} G \beta_{:,1} & G \beta_{:,2} \end{bmatrix} \begin{bmatrix} \alpha_{1} \\ \alpha_{2} \end{bmatrix} \end{align}\]

Finally, we define the product between the matrix \(G\) and the matrix \(B\) as

(246)#\[\begin{align} G B = \begin{bmatrix} G \beta_{:,1} & G \beta_{:,2} \end{bmatrix} \end{align}\]

where \(B\) is a matrix that stacks, column-wise, the vectors \(\beta_{:,1}\) and \(\beta_{:,2}\)

Using the inner product, we can also write the matrix product between \(G\) and \(B\) as

(247)#\[\begin{align} GB_{ij} = \begin{bmatrix} G_{i,:} ' B_{:,j} \end{bmatrix} \end{align}\]

\(AB = A A^{-1} = I\)#

Just like in univariate algebra there exists (or not) a function inverse, \(f^{-1}\), there can also exist a matrix inverse. A matrix \(B\) is an inverse to a matrix \(A\) is

(248)#\[\begin{align} AB = BA = I \end{align}\]

We denote the matrix inverse to \(A\) as the matrix \(A^{-1}\).

Expectation algebra (Part 1)#

The expected value, and so the covariance and the variance, have many convienant properties due to the fact that the expected value is linear. i call these identities Expectation algebra.

First, we should make clear that the expected value is linear

Given a random variable \(X\) and constant \(\alpha \in \mathbb{R}\),

(249)#\[\begin{align} \mathbb{E}(\alpha X) &= \alpha \mathbb{E}(X) \\ \mathbb{E}(X + Y) &= \mathbb{E}(X) + \mathbb{E}(Y) \\ \end{align}\]

and so for X,Y r.v.s and \(\alpha,\beta \in \mathbb{R}\)

(250)#\[\begin{align} \mathbb{E}(\alpha X + \beta Y) &= \alpha \mathbb{E}(X) + \beta \mathbb{E}(Y) \\ \end{align}\]

If we consider the vectors

(251)#\[\begin{align} \lambda = \begin{bmatrix} \alpha & \beta \end{bmatrix} ; \; \; Z = \begin{bmatrix} X \\ Y \end{bmatrix} \end{align}\]

then we can write

(252)#\[\begin{align} \mathbb{E}( \lambda Z) = \lambda \mathbb{E}(Z) \end{align}\]

where \(\mathbb{E}(Z)\) applies the expectation to each random variable in the vector, or

(253)#\[\begin{align} \mathbb{E}(Z) = \mathbb{E}\left( \begin{bmatrix} X \\ Y \end{bmatrix} \right) = \begin{bmatrix} \mathbb{E}(X) \\ \mathbb{E}(Y) \end{bmatrix} \end{align}\]

Because the Covariance is also an expected value, it also has important properties.

Given a random variable \(X\), \(\text{Cov}(X,X) = \mathbb{E}[ (X-\mathbb{E}(X)) (X-\mathbb{E}(X)) ] = \mathbb{E}[(X-\mathbb{E}(X))^{2}] = \mathbb{V}(X)\).

The covariance is linear in its first and second position separately. In other words,

(254)#\[\begin{align} \text{Cov}(\alpha X, Y) &= \alpha \text{Cov}(X,Y) \\ \text{Cov}(X, \beta Y) &= \beta \text{Cov}(X,Y) \\ \text{Cov}(X + Y , Z) &= \text{Cov}(X,Z) + \text{Cov}(Y,Z) \\ \text{Cov}(X , Y+Z) &= \text{Cov}(X,Y) + \text{Cov}(X,Z) \\ \end{align}\]

We can put these properties above all together to show that

(255)#\[\begin{align} \text{Cov}(a X + bY , c Q + d R) &= \text{Cov}(a X, c Q) + \text{Cov}(a X , d R) + \text{Cov}(bY, cQ) + \text{Cov}(bY, dR) \\ &= ac\text{Cov}(X, Q) + ad \text{Cov}(X , R) + bc\text{Cov}(Y, Q) + bd \text{Cov}(Y, R) \end{align}\]

We will further discuss below Expectation algebra that concerns the covariance when it is needed.

Normal + Normal = Normal#

Given two random variables, \(X \sim \mathcal{N}(\mu_{x}, \sigma_{x})\) and \(Y \sim \mathcal{N}(\mu_{y}, \sigma_{y})\), there is a highly unique property of the Normal: the sum of two (or more) Normal densities is itself another Normal distribution.

To be clear, the sum of two variables with the same density rarely, if ever, is distributed the same as the original two random variables. This property is so unique that sometimes the Normal density is defined as that density such that when summed is itself.

To be clear, the new normal density that is produced from a sum has a different expected value and variance than the original two random variables,

(256)#\[\begin{align} (X+Y) \sim \mathcal{N}( \mu_{x} + \mu_{y}, \sigma_{x}^{2} + \sigma_{y}^{2} + 2 \sigma_{xy} ) \end{align}\]

where \(\sigma_{xy} = \text{Cov}(X,Y)\).

This result is difficult to prove by definition because

(257)#\[\begin{align} f_{X+Y}(z) = \int_{x = -\infty}^{\infty} \int_{y = -\infty}^{\infty} f_{X}(x) f_{Y}(z-x) \;dy \;dx \end{align}\]

However, we can note that a Normal density is completely defined from its expected value variance. That is, recall that

(258)#\[\begin{align} X &\sim \mathcal{N}( \mu_{x}, \sigma^{2}_{x} ) \\ \mathbb{E}(X) &= \mu_{x} ; \; \mathbb{V}(X) = \sigma_{x}^{2} \end{align}\]

Lets use expedctation algebra for the random variable \(X+Y\) to find the expected value and variance.

First, the expected value:

(259)#\[\begin{align} \mathbb{E}(X+Y) = \mathbb{E}(X) + \mathbb{E}(Y) = \mu_{x} + \mu_{y} \\ \end{align}\]

and then the variance

(260)#\[\begin{align} \mathbb{V}(X+Y) &= \text{Cov}(X+Y,X+Y) \\ &= \text{Cov}(X,X) + \text{Cov}(X,Y) + \text{Cov}(Y,X) + \text{Cov}(Y,Y) \\ &= \text{Cov}(X,X) + 2 \text{Cov}(X,Y) + \text{Cov}(Y,Y) \\ &= \mathbb{V}(X) + 2 \text{Cov}(X,Y) + \mathbb{V}(Y) \\ &= \sigma^{2}_{x} + 2 \sigma_{xy} + \sigma^{2}_{y} \\ \end{align}\]

This expected value and variance completely define \(X+Y\), and we can further generalize this ideas as

(261)#\[\begin{align} (\alpha X + \beta Y) \sim \mathcal{N}( \alpha \mu_{x} + \beta \mu_{y}, \alpha^{2}\sigma^{2}_{x} + \beta^{2}\sigma^{2}_{y} + 2\alpha\beta\sigma_{xy} ) \end{align}\]

If we form the vector of constants \(\lambda = [\alpha, \beta]\) and the vector of random variables \(X = \begin{bmatrix}X_{1}\\X_{2}\end{bmatrix}\) then we could rewrite the above as

(262)#\[\begin{align} \mathbb{E}(\alpha X_{1} + \beta X_{2}) = \mathbb{E}(\lambda X) = \lambda \mathbb{E}(X) = \lambda \mu = \alpha \mu_{x} + \beta \mu_{y} \\ \end{align}\]

where \(\mu = [\mathbb{E}(X_{1}),\mathbb{E}(X_{2})]'\).

We can also take the same steps for the variance to find

(263)#\[\begin{align} \mathbb{V}(\alpha X_{1} + \beta X_{2}) = \text{Cov}(\alpha X_{1} + \beta X_{2},\alpha X_{1} + \beta X_{2}) = \alpha^{2} \text{Cov}(X,X) + \alpha \beta \text{Cov}(X,Y) + \beta \alpha \text{Cov}(Y,X) + \beta^{2} \text{Cov}(Y,Y) \\ \end{align}\]

The goal here is to write the above using the vector \(\lambda = [\alpha, \beta]\). The trick is to store the four covariances into a \(2 \times 2\) matrix

(264)#\[\begin{align} \Sigma = \begin{bmatrix} \text{Cov}(X,X) & \text{Cov}(X,Y) \\ \text{Cov}(Y,X) & \text{Cov}(Y,Y) \end{bmatrix} \end{align}\]

\(\Sigma\) contains all four covariances (remember the variance is a special type of covariance). This matrix \(\Sigma\) is called the covariance matrix.

Then we can write

(265)#\[\begin{align} \mathbb{V}( \lambda X ) = &= \lambda \Sigma \lambda^{'} \end{align}\]

These are the tools that we need to describe the Multivariate Normal as well as any linear transformation of a vector of Normally distributed random variables.

An inner product view point for the MVN#

Define \(\lambda \in \mathbb{R}^{D}\) be a \(1 \times D\) length vector and define \(X\) as a \(D \times 1\) length vector of random variables. Then \(X\) has a multivariate normal distribution if

(266)#\[\begin{align} \lambda X \sim \mathcal{N}( \lambda \mu , \lambda \Sigma \lambda^{'} ) \end{align}\]

This is a powerful viewpoint, and one that we can take even further by considering a “stack” of inner products.

Consider the matrix \(A_{m \times d}\) that is a stack of \(m\) vectors, \(\lambda_{1}, \lambda_{2}, \cdots, \lambda_{m}\). To be clear, we can write

(267)#\[\begin{align} A = \begin{bmatrix} --- & \lambda_{1} & ---\\ --- & \lambda_{2} & ---\\ --- &\vdots & ---\\ --- &\lambda_{m} & ---\\ \end{bmatrix} \end{align}\]

Also consider the vector of random variables \(X = [X_{1},X_{2},\cdots,X_{d}]^{'}\). Then

(268)#\[\begin{align} \mathbb{E}(Ax) &= \begin{bmatrix} A_{(1,:)} \mathbb{E}(X)\\ A_{(2,:)} \mathbb{E}(X) \\ \vdots \\ \end{bmatrix}_{m \times 1} \\ &= A \mathbb{E}(X) = A \mu \end{align}\]

Though \(\lamba x\) creates a single values, \(Ax\) creates a vector and so we need to discuss then a covariance matrix. Recall from above that the Coviariance matrix is a matrix where the entry in row \(i\) and column \(j\) is \(\text{Cov}(X_{i},X_{j})\).

In our example, the \((i,j)\) entry is then \(\text{Cov}(A_{i} X,A_{j}X)\) and we know that this equals \(A_{i} \text{Cov}(X,X)A_{j}^{'}\). Then the covariance matrix is

(269)#\[\begin{align} \Sigma_{Ax} &= \begin{bmatrix} A_{1} \text{Cov}(X_{1},X_{1})A_{1}^{'} & A_{1} \text{Cov}(X_{1},X_{2})A_{2}^{'} & \cdots \\ A_{2} \text{Cov}(X_{2},X_{1})A_{1}^{'} & A_{2} \text{Cov}(X_{2},X_{2})A_{2}^{'} & \cdots \\ A_{3} \text{Cov}(X_{3},X_{1})A_{1}^{'} & A_{3} \text{Cov}(X_{3},X_{2})A_{2}^{'} & \cdots \\ \vdots & \vdots & \ddots \\ \end{bmatrix} \\ &= \begin{bmatrix} A_{1} & A_{2} & \cdots & A_{d}] \end{bmatrix} \begin{bmatrix} \text{Cov}(X_{1},X_{1})A_{1}^{'} & \text{Cov}(X_{1},X_{2})A_{2}^{'} & \cdots \\ \text{Cov}(X_{2},X_{1})A_{1}^{'} & \text{Cov}(X_{2},X_{2})A_{2}^{'} & \cdots \\ \text{Cov}(X_{3},X_{1})A_{1}^{'} & \text{Cov}(X_{3},X_{2})A_{2}^{'} & \cdots \\ \vdots & \vdots & \ddots \\ \end{bmatrix} \\ &= A \begin{bmatrix} \text{Cov}(X_{1},X_{1}) & \text{Cov}(X_{1},X_{2}) & \cdots \\ \text{Cov}(X_{2},X_{1}) & \text{Cov}(X_{2},X_{2}) & \cdots \\ \text{Cov}(X_{3},X_{1}) & \text{Cov}(X_{3},X_{2}) & \cdots \\ \vdots & \vdots & \ddots \\ \end{bmatrix} \begin{bmatrix} A_{1}^{'} \\ A_{2}^{'} \\ \vdots \\ A_{d}^{'} \end{bmatrix}\\ &= A \text{Cov}(X,X)A^{'} \end{align}\]

where we define \(\text{Cov}(X,X)\) as

(270)#\[\begin{align} \text{Cov}(X,X) = \begin{bmatrix}\text{Cov}(X_{1},X_{1}) & \text{Cov}(X_{1},X_{2}) & \cdots \\ \text{Cov}(X_{2},X_{1}) & \text{Cov}(X_{2},X_{2}) & \cdots \\ \text{Cov}(X_{3},X_{1}) & \text{Cov}(X_{3},X_{2}) & \cdots \\ \vdots & \vdots & \ddots \\ \end{bmatrix} \end{align}\]

Expectation algebra (Part 2)#

The part two of expedctation linear algebra concerns the Covariance matrix.

Given a vector of random variables, the covariance matrix defined as above is equal to

(271)#\[\begin{align} \text{Cov}(X,X) &= \begin{bmatrix}\text{Cov}(X_{1},X_{1}) & \text{Cov}(X_{1},X_{2}) & \cdots \\ \text{Cov}(X_{2},X_{1}) & \text{Cov}(X_{2},X_{2}) & \cdots \\ \text{Cov}(X_{3},X_{1}) & \text{Cov}(X_{3},X_{2}) & \cdots \\ \vdots & \vdots & \ddots \\ \end{bmatrix} \\ &=\begin{bmatrix}\mathbb{E}\left[(X_{1}-\mu_{1})(X_{1}-\mu_{1})\right] & \mathbb{E}\left[(X_{1}-\mu_{1})(X_{2}-\mu_{2})\right] & \cdots \\ \mathbb{E}\left[(X_{2}-\mu_{2})(X_{1}-\mu_{1})\right] & \mathbb{E}\left[(X_{2}-\mu_{2})(X_{2}-\mu_{2})\right] & \cdots \\ \mathbb{E}\left[(X_{3}-\mu_{3})(X_{1}-\mu_{1})\right] & \mathbb{E}\left[(X_{3}-\mu_{3})(X_{2}-\mu_{2})\right] & \cdots \\ \vdots & \vdots & \ddots \\ \end{bmatrix}\\ &= \mathbb{E}\left( \begin{bmatrix}\left[(X_{1}-\mu_{1})(X_{1}-\mu_{1})\right] & \left[(X_{1}-\mu_{1})(X_{2}-\mu_{2})\right] & \cdots \\ \left[(X_{2}-\mu_{2})(X_{1}-\mu_{1})\right] & \left[(X_{2}-\mu_{2})(X_{2}-\mu_{2})\right] & \cdots \\ \left[(X_{3}-\mu_{3})(X_{1}-\mu_{1})\right] & \left[(X_{3}-\mu_{3})(X_{2}-\mu_{2})\right] & \cdots \\ \vdots & \vdots & \ddots \\ \end{bmatrix}\right)\\ &=\mathbb{E}\left( [X-\mu][X-\mu]^{'} \right)\\ \end{align}\]

Then the familar covariance matrix for \(Ax\) falls out directly from this definition.

(272)#\[\begin{align} \text{Cov}(Ax,Ax) &= \mathbb{E}( [Ax-A\mu][Ax-A\mu]^{'} ) \\ &= \mathbb{E}( A(x-\mu)(A(x-\mu))^{'} ) \\ &= \mathbb{E}( A(x-\mu)(x-\mu)^{'}A^{'} ) \\ &= A \mathbb{E}( (x-\mu)(x-\mu)^{'} )A^{'} \\ &= A \text{Cov}(x,x) A^{'} \\ \end{align} \]

Marginal#

With the inner-product view of the MVN, determining the marginal density is trivial. Given a vector \(X \sim \text{MVN}(\mu, \Sigma)\) where \(\mathbb{E}(X) = \mu\) and the covariance matrix for \(X\) is \(\text{Cov}(X,X) = \Sigma\), the marginal density for \(X_{j}\) is \(\lambda X\) where the vector \(\lambda\) is a vector with the value one in the \(j^{\text{th}}\) position and the value zero otherwise. Then,

(273)#\[\begin{align} \lambda X &\sim \mathcal{N}( \lambda \mu, \lambda \Sigma \lambda^{'} ) \\ &\sim \mathcal{N}( \mu_{j}, \Sigma_{jj} ) \\ &\sim \mathcal{N}( \mu_{j}, \sigma^{2}_{j} ) \\ \end{align}\]

That is, for a multivariate normal. distribution the marginal density of \(X_{j}\), by integrating out all other random variables, is just a normal density with the jth mean and j,jth entry of the covariance matrix.

Let

(274)#\[\begin{align} \begin{bmatrix} X_{1} \\ X_{2} \\ X_{3} \\ \end{bmatrix} \sim \text{MVN}\left( \begin{bmatrix} -1 \\ 2 \\ 3 \\ \end{bmatrix} , \begin{bmatrix} 1 & 0.5 & 0.3 \\ 0.5 & 2 & 0.4 \\ 0.3 & 0.4 & 3 \\ \end{bmatrix} \right) \end{align}\]

Then \(X_{2} \sim \mathcal{N}( 2, 2 )\)

Conditional#

Though the marginal distribution for the MVN is easy to extarct, the conditional distribution of one or more random variables given one or more others is much more difficult to derive. First, we will write down the solution and then below derive this solution.

Given a vector of random variables \([X_{1}, X_{2}, \cdots X_{n}]^{T}\) that we partition into \(X = [X_{1}, X_{2}, \cdots X_{m}]\) and \(Y = [X_{m+1}, \cdots, X_{n}]\), the conditional distribution of \(X|Y\) is again a multivariate distribution

(275)#\[\begin{align} X | Y &\sim \text{MVN}( \mu_{X|Y} , \Sigma_{X|Y} ) \\ \mu_{X|Y} &= \mu_{X} + \Sigma_{XY} \Sigma_{YY}^{-1}(y - \mu_{y}) \\ \Sigma_{X|Y} &= \Sigma_{XX} - \Sigma_{XY} \Sigma_{YY}^{-1} \Sigma_{YX} \end{align}\]

Derivation#

To derive the above result, we start with the probability density function for \([X,Y]^{T}\). Importantly, a distribution that is conditional on \(Y\) is one such that

(276)#\[\begin{align} f(X|Y=y) = \frac{f(X,Y=y)}{\int_{-\infty}^{\infty} f(X,Y=y) \; dx } \end{align}\]

In other words, the density of this conditional distribution only depends on \(X\). Our goal is to remove any expression that include \(y\) and recognize this density function as a familar distribution.

(277)#\[\begin{align} f(X|Y=y) &= \frac{f(X,Y=y)}{\int_{-\infty}^{\infty} f(X,Y=y) \; dx } \\ &\propto f(X,Y=y) \end{align}\]

For the MVN, this means that

(278)#\[\begin{align} f(\begin{bmatrix} X \\ Y \end{bmatrix}) &= \frac{1}{ (2\pi)^{2/2} |\Sigma|^{1/2} } e^{ -\frac{1}{2} \left(\begin{bmatrix} X \\ Y \end{bmatrix} - \begin{bmatrix} \mu_{X} \\ \mu_{Y}\end{bmatrix}\right)^{T} \Sigma^{-1} \left(\begin{bmatrix} X \\ Y \end{bmatrix} - \begin{bmatrix} \mu_{X} \\ \mu_{Y} \end{bmatrix} \right) } \\ &\propto e^{ -\frac{1}{2} \left(\begin{bmatrix} X \\ Y \end{bmatrix} - \begin{bmatrix} \mu_{X} \\ \mu_{Y}\end{bmatrix}\right)^{T} \Sigma^{-1} \left(\begin{bmatrix} X \\ Y \end{bmatrix} - \begin{bmatrix} \mu_{X} \\ \mu_{Y} \end{bmatrix} \right) }\\ &= e^{ -\frac{1}{2} \left(\begin{bmatrix} X \\ Y \end{bmatrix} - \begin{bmatrix} \mu_{X} \\ \mu_{Y}\end{bmatrix}\right)^{T} \Lambda \left(\begin{bmatrix} X \\ Y \end{bmatrix} - \begin{bmatrix} \mu_{X} \\ \mu_{Y} \end{bmatrix} \right) } \end{align}\]

First, we removed the normalizing constant and defined \(\Lambda = \Sigma^{-1}\). The matrix \(\Lambda\) is often named the Precision matrix.

Our next steps will be to expand the expression in the exponential.

(279)#\[\begin{align} &= e^{ -\frac{1}{2} \left[ (X^{T} - \mu_{X}^{T}) \Lambda_{XX} (X-\mu_{X}) + (X^{T} - \mu_{X}^{T})\Lambda_{XY}(Y-\mu_{Y}) + (Y^{T} - \mu_{Y}^{T})\Lambda_{YX}(X-\mu_{X}) + (Y^{T} - \mu_{Y}^{T})\Lambda_{YY}(Y^{T} - \mu_{Y}^{T}) \right]} \end{align}\]

Lets continue to expand and then remove any items that are a constant with respect to X (ie do not include the variable X).

(280)#\[\begin{align} f(X|Y) & \propto e^{ -\frac{1}{2} \left[ X^{T} \Lambda_{XX} X - X^{T} \Lambda_{XX} \mu_{X} - \mu_{X}^{T} \Lambda_{XX} X + X^{T} \Lambda_{XY} (Y-\mu_{Y}) + (Y-\mu_{Y})^{T} \Lambda_{YX} X \right] } \\ &= e^{ -\frac{1}{2} \left[ X^{T} \Lambda_{XX} X - X^{T} \Lambda_{XX} \mu_{X} + X^{T} - X^{T} \Lambda_{XX} \mu_{X} + \Lambda_{XY} (Y-\mu_{Y}) + X^{T} \Lambda_{XY} (Y-\mu_{Y}) \right] } \\ &= e^{ -\frac{1}{2} \left[ X^{T} \Lambda_{XX} X - 2X^{T} \Lambda_{XX} \mu_{X} + X^{T} + \Lambda_{XY} (Y-\mu_{Y}) + X^{T} \Lambda_{XY} (Y-\mu_{Y}) \right] }\\ &= e^{ -\frac{1}{2} \left[ X^{T} \Lambda_{XX} X - 2X^{T} \Lambda_{XX} \mu_{X} + 2X^{T} \Lambda_{XY} (Y-\mu_{Y}) \right] } \\ &= e^{ -\frac{1}{2} \left[ X^{T} \Lambda_{XX} X - 2X^{T} \left( \Lambda_{XX} \mu_{X} - \Lambda_{XY} (Y-\mu_{Y}) \right) \right] } \\ \end{align}\]

Now we are left to decide if the above function, up to a proportionality constant, looks like a density that we know. Intuitively, this almost looks like a MVN. Lets explore the MVN a bit to see if we can match the above function and traditional MVN.

MVN up to a constant#

The probability density over a vector \(X\) for an \(\text{MVN}(\mu, \Lambda)\) up to a constant looks like this

(281)#\[\begin{align} f(X) &\propto e^{ \left[ (X-\mu)^{T} \Lambda (X-\mu) \right] } \\ &= e^{ \left[ \left(X^{T} - \mu^{T}\right) \Lambda (X-\mu) \right] } \\ &= e^{ \left[ \left(X^{T} - \mu^{T}\right) \left(\Lambda X - \Lambda \mu \right) \right] } \\ &= e^{ \left[ X^{T} \Lambda X - X^{T} \Lambda \mu - \mu^{T} \Lambda X + \mu^{T} \Lambda \mu \right] } \\ &= e^{ \left[ X^{T} \Lambda X - X^{T} \Lambda \mu - X^{T} \Lambda \mu + \mu^{T} \Lambda \mu \right] } \\ &= e^{ \left[ X^{T} \Lambda X - 2X^{T} \Lambda \mu + \mu^{T} \Lambda \mu \right] } \\ & \propto e^{ \left[ X^{T} \Lambda X - 2X^{T} \Lambda \mu \right] } \\ \end{align}\]

We see that, for a traditional multivariate normal density, the precision matrix is uniquely identified as the matrix that sits between \(X^{T}\) and \(X^{T}\). In other words, this term uniquely identifies the precision matrix

(282)#\[\begin{align} \Lambda \implies X^{T} \Lambda X \\ \end{align}\]

We also see that the mean vector, \(\mu\), is identified here

(283)#\[\begin{align} \mu \implies 2X^{T} \Lambda \mu \\ \end{align}\]

Lets apply this idea to find the precision matrix and mean vector for our conditional density. First, we see that for our conditional density we have a term \(X^{T} \Lambda_{XX}X\). That is, the precision matrix for our conditional density, which we will call \(\Lambda_{X|Y}\) is \(\Lambda_{X|Y} = \Lambda_{XX}\).

Now that we have the precision matrix, we can match up the remaining that accompanies \(2 X^{T}\) to find that

(284)#\[\begin{align} \Lambda_{XX} \mu_{X} - \Lambda_{XY} (Y-\mu_{Y}) &= \Lambda_{X|Y} \mu \\ \Lambda_{XX} \mu_{X} - \Lambda_{XY} (Y-\mu_{Y}) &= \Lambda_{XX} \mu \\ \Lambda_{XX}^{-1} \left(\Lambda_{XX} \mu_{X} - \Lambda_{XY} (Y-\mu_{Y})\right) &= \Lambda_{XX}^{-1}\Lambda_{XX} \mu \\ \Lambda_{XX}^{-1} \left(\Lambda_{XX} \mu_{X} - \Lambda_{XY} (Y-\mu_{Y})\right) &= \mu \\ \mu_{X} - \Lambda_{XX}^{-1} \Lambda_{XY} (Y-\mu_{Y}) &= \mu \\ \end{align}\]

That is, the conditional distribution of \(X|Y\) does indeed have a multivariate normal density like

(285)#\[\begin{align} X|Y &\sim \text{MVN}( \mu_{X|Y}, \Lambda_{X|Y} ) \\ \mu_{X|Y} &= \mu_{X} - \Lambda_{XX}^{-1} \Lambda_{XY} (Y-\mu_{Y}) \\ \Lambda_{X|Y} &= \Lambda_{XX} \end{align}\]

Though we found the conditional proability distribution of \(X\) given \(Y\), this is in terms of the precision matrix \(\Lambda\). Ideally, we would also want to be able to express this distribution in terms of \(\Sigma\) as well.

Our approach for defining the above in terms of \(\Sigma\) is to recognize that \(\Sigma^{-1} = \Lambda\). In other words, we need to find the inverse of \(\Lambda\) to define \(\Sigma\)

(286)#\[\begin{align} \begin{bmatrix} \Lambda_{XX} & \Lambda_{XY} \\ \Lambda_{YX} & \Lambda_{YY} \end{bmatrix} \begin{bmatrix} A & B \\ C & D \end{bmatrix} = \begin{bmatrix} I_{XX} & 0_{XY} \\ 0_{YX} & I_{YY} \end{bmatrix} \end{align}\]

This identitiy implies that, for example,

(287)#\[\begin{align} \Lambda_{XX} A + \Lambda_{XY} C = I_{XX} \\ \Lambda_{YX} C + \Lambda_{YY} D = 0 \\ C = - \Lambda^{-1}_{YX} \Lambda_{YY} D \end{align}\]

and so then

(288)#\[\begin{align} \Sigma_{XX}\Lambda_{XY} + \Sigma_{XY}\Lambda_{YY} = 0 \\ \Sigma_{XX}\Lambda_{YX} + \Sigma_{XY}\Lambda_{YY} = 0 \\ \Lambda_{YX} = - \Sigma^{-1}_{XX}\Sigma_{XY}\Lambda_{YY} \\ \end{align}\]

and

(289)#\[\begin{align} \Sigma_{XX}\Lambda_{XX} - \Sigma_{XY}(\Sigma^{-1}_{XX}\Sigma_{XY}\Lambda_{YY}) = I_{XX}\\ \end{align}\]
import numpy as np
import matplotlib.pyplot as plt
import ipywidgets as widgets
from scipy.stats import multivariate_normal, norm

def plot_mvn_marginal_conditional(mu1=0.0, mu2=0.0, rho=0.0, x2_star=0.0):
    sigma1 = 1.0
    sigma2 = 1.0

    mu = np.array([mu1, mu2])
    Sigma = np.array([
        [sigma1**2, rho * sigma1 * sigma2],
        [rho * sigma1 * sigma2, sigma2**2]
    ])

    x = np.linspace(-5, 5, 300)
    y = np.linspace(-5, 5, 300)
    X, Y = np.meshgrid(x, y)

    pos = np.dstack((X, Y))
    Z = multivariate_normal(mean=mu, cov=Sigma).pdf(pos)

    # Marginals
    fx = norm(loc=mu1, scale=sigma1).pdf(x)
    fy = norm(loc=mu2, scale=sigma2).pdf(y)

    # Conditional: X1 | X2 = x2_star
    cond_mean = mu1 + rho * sigma1 / sigma2 * (x2_star - mu2)
    cond_var = sigma1**2 * (1 - rho**2)
    cond_sd = np.sqrt(cond_var)
    f_cond = norm(loc=cond_mean, scale=cond_sd).pdf(x)

    fig = plt.figure(figsize=(8, 7))

    ax_joint = fig.add_axes([0.15, 0.15, 0.55, 0.55])
    ax_x = fig.add_axes([0.15, 0.72, 0.55, 0.18], sharex=ax_joint)
    ax_y = fig.add_axes([0.72, 0.15, 0.18, 0.55], sharey=ax_joint)

    # Joint density contours
    ax_joint.contour(X, Y, Z, levels=3, colors="black")
    ax_joint.scatter(mu1, mu2, s=60, color="black")
    ax_joint.axhline(0, linewidth=0.5)
    ax_joint.axvline(0, linewidth=0.5)

    # Show the conditioning value X2 = x2_star
    ax_joint.axhline(x2_star, linestyle="--", linewidth=1.5)

    ax_joint.set_xlim(-5, 5)
    ax_joint.set_ylim(-5, 5)
    ax_joint.set_aspect("equal")
    ax_joint.set_xlabel(r"$X_1$")
    ax_joint.set_ylabel(r"$X_2$")

    # Marginal density of X1
    ax_x.plot(x, fx, label=r"$f_{X_1}(x_1)$")
    ax_x.plot(x, f_cond, linestyle="--", label=rf"$f_{{X_1 \mid X_2}}(x_1 \mid {x2_star:.2f})$")
    ax_x.axvline(cond_mean, linestyle="--", linewidth=1)
    ax_x.legend(fontsize=8)
    ax_x.tick_params(labelbottom=False)

    # Marginal density of X2
    ax_y.plot(fy, y, label=r"$f_{X_2}(x_2)$")
    ax_y.axhline(x2_star, linestyle="--", linewidth=1.5)
    ax_y.tick_params(labelleft=False)

    fig.suptitle(
        rf"$X_1 \mid X_2={x2_star:.2f} \sim N({cond_mean:.2f}, {cond_var:.2f})$",
        y=0.98
    )

    plt.show()

widgets.interact(
    plot_mvn_marginal_conditional,
    mu1=widgets.FloatSlider(value=0, min=-3, max=3, step=0.1, description=r"$\mu_1$"),
    mu2=widgets.FloatSlider(value=0, min=-3, max=3, step=0.1, description=r"$\mu_2$"),
    rho=widgets.FloatSlider(value=0, min=-0.95, max=0.95, step=0.05, description=r"$\rho$"),
    x2_star=widgets.FloatSlider(value=0, min=-3, max=3, step=0.1, description=r"$x_2^*$")
);