Random variables#

Introduction#

The foundations of probability are built on sets, yet data is more naturally stored and more easily computed on if it is represented numerically.

Random variables match each outcome in our sample space to a value on the number line.

In addition to computational advantages, random variables help us extract from our data the most important characteristics, and they serve as building blocks which we can use to create powerful models. Random variables are also a language we can use to communicate our modeling efforts to other mathematicians, statisticians, and data scientists.

Suppose we hypothesize that the frequency of social media posts on some popular outlet are related to influenza-like illness~(ILI)—a syndromic diagnosis suggesting a patient may have influenza. A patient is diagnosed with ILI if their temperature is measured to be at or above 38C and symptoms resembling the flu. Because influenza is most active in winter and spring, we collect a random sample, each day, of social media posts from September to May and in addition we collect the proportion of patients who are admitted to the hospital and are diagnosed with influenza-like illness at the US national level.

The above hypothesis, data collection, and future inference has numerous details. However, we will see shortly that we can simplify our hypothesis by using random variables.

Maps from the sample space to the number line#

Given a sample space \(\mathcal{G}\), a \textbf{random variable}, (e.g. \(X\)), is a function from each element in \(\mathcal{G}\)—from each outcome—to a value on the real number line. The real number line contains all numbers: integer and decimal, from negative to positive infinity.

\textbf{Example:} Suppose our sample space contains two elements \(\mathcal{G} = \{ a,b \}\). We may decide to define a random variable \(X\) that maps the outcome \(a\) to the value \(-1\) and the outcome \(b\) to the value \(1\). In otherwords, \(X(a)=-1\) and \(X(b)=1\). We could as well define a random variable \(Y\) on the same sample space such that \(Y(a)=0\) and \(Y(b)=1\).

\textbf{Example:} Suppose our sample space contains all integers from 0 to 1000 \(\mathcal{G} = \{0,1,2,3\cdots,1000 \}\). We may be most interested in when an integer is even or odd, and so we can define a random variable \(Y(y)=0\) when \(y\), our outcome, is an odd integer and \(Y(y)=1\) when \(y\) is even. This is an example of how a random variable can distill down a sample space with many outcomes into a random variable with two.

\textbf{Example:} Suppose we decide to study the relationship between the cumulative total number of cigarettes smoked by a person form the date that they started smoking and the presence of lung cancer. We define our sample space to be \(\mathcal{G} = \{ (x,y) | x \in \mathbb{Z}, y \in \{0,1\} \}\). We define two random variables, a random variable \(X\) that maps the outcome \((x,y)\) to the value in the first position \(x\), and a random variable \(Y\) that maps the \(outcome (x,y)\) to the value in the second position \(y\). Though our outcomes are linked, we can use random variables to think about two separate outcomes—cigarettes smoked and lung cancer—and how they interact.

A new sample space#

When we build a random variable \((X)\) that maps outcomes to values on the number line we create a new sample space which we will call the support of \(X\) or \(supp(X)\). Define a sample space \(\mathcal{G}\) without outcomes \(o_{i}\). Then the \textbf{support of X} is

(58)#\[\begin{align} supp(X) = \{x | X(o) = x \text{ for some outcome } o \text{ in } \mathcal{G} \} \end{align}\]

Our new sample space is the set of all the potential values that our random variable \(X\) can produce. This is a sample space linked to \(\mathcal{G}\), but in practice after we develop a random variable we often no longer reference \(\mathcal{G}\).

\ex In our above example where \(\mathcal{G} = \{a,b\}\), the random variable \(X\) has support \(supp(X) = \{-1,1\}\) and \(supp(Y)=\{0,1\}\). Lets look at another example, when above \(\mathcal{G}\) is the set of all integers from 0 to 1000. Even though the sample space is quite large, the random variable that maps the integers to 0 when they are odd and 1 when even has a small support \((supp(Y) = \{0,1\})\).

How to assign probabilities to a random variable#

Random variable themselves do not require that we include the probability of each of their values. Random variables are a function from outcomes to the real numbers—nothing more. That said, in practice we build random variables expecting that the probabilities we assign to outcomes in our sample space will correspond to probabilities assigned to values of our random variable.

We assign a probability to the value \(x\), which belongs in the support of random variable \(X\), the sum of the probabilities of all the outcomes that \(X\) maps to \(x\).

(59)#\[\begin{equation} P(X=x) = P(o_{1}) + P(o_{2}) + \cdots + P(o_{n}) \end{equation}\]

where each outcome \(o_{1},o_{2},\cdots,o_{n}\) is mapped by \(X\) to the value \(x\). In other words, \(X(o)=x\) for each of \(o_{1},o_{2},\cdots,o_{n}\).

\ex Define a \(\mathcal{S} = \{a,b,c,d,e\}\) and a random variable \(X\) that maps the outcomes to the following values

Outcome

P(outcome)

X(outcome)

a

0.1

0

b

0.25

1

c

0.15

1

d

0.3

2

e

0.2

0

We assign the probability that \(X=0\) as the sum of the probabilities assigned to outcome \(a\) and outcome \(e\), or

(60)#\[\begin{align} P(X=0) &= P(\{a\}) + P(\{e\})\\ &= 0.1+0.2 = 0.3 \end{align}\]

We can run the same procedure for all the elements in the support of \(X\),

(61)#\[\begin{align} P(X=1) &= P(\{b\}) + P(\{c\})\\ &= 0.25+0.15 = 0.40\\ P(X=2) &= P(\{d\}) = 0.3 = 0.30, \end{align}\]

and organize our work in a table

X

P(X = x)

0

0.30

1

0.40

2

0.30

A \textbf{probability distribution} for a random variable \(X\) is a set of tuples where the first position in each tuple is a value in the support of \(X\) and the second position in the tuple is the corresponding probability assigned to that value.

\ex A probability distribution for the random variable \(X\) above is \(\{(0,0.30),(1,0.40),(2,0.30)\}\).

\ex Imagine we run an experiment that collects data on marathon runners. We decide to collect the number of elapsed minutes until they finish the race. Our sample space is defined as all positive integers \(\mathcal{S} = \{1,2,3,\cdots,\}\). We may decide to build a random variable \(X\) that maps outcomes less than 60 to the value 1, outcomes from 61 to 120 to the value 2, and outcomes greater than 120 to the value 3. One potential probability distribution for \(X\) is \(\{(1,0.10),(2,0.50),(3,0.40)\}\). For this probability distribution, \(P(X=1) = 0.10\), \(P(X=2) = 0.50\), and \(P(X=3) = 0.40\).

rvsLink.pdf

Probability mass function#

There are several supportive tools that we can use to help us better understand random variable we create. The first is the probability mass function, or p.m.f. The \textbf{probability mass function} is a \underline{function} that maps values in the support of a random variable \(X\) to their corresponding probabilities. Inputs are values of \(X\), outputs are probabilities.

The probability mass function is a convenient way to organize a probability distribution and it allows us to transfer all the information we know about functions to random variables.

\ex Define a random variable \(Y\) with support \(\{-1,0,1\}\), probability distribution {(-1,0.2),(0,0.5),(1,0.3)}, and probability mass function

(62)#\[\begin{equation} f(y) = \begin{cases} 0.2 & \text{ when } y=-1\\ 0.5 & \text{ when } y = 0\\ 0.3 & \text{ when } y=1 \end{cases} \end{equation}\]

The function—our probability mass function—is a type of function called a \textbf{piecewise} function.

We can ask our pm.f. to return the probability for a given value

(63)#\[\begin{equation} f(1) = 0.3 \end{equation}\]

and we can visualize our probability mass function using, for example, a barplot.

pmfviz.pdf

Figure 1: A barplot for visualizing the probability mass function of our random variable ( Y ). The support of ( Y ) is plotted on the horizontal axis, and the height of each bar corresponds to the probability assigned to that value in the support.

\textbf{Distributed as \(f\):} Because we can use the probability mass function to describe the probability distribution of a random variable, we will often write

(64)#\[\begin{equation} Y \sim f \end{equation}\]

The above formula is read “the random variable \(Y\) is distributed as \(f\)”, and what we mean when a random variable is distributed as \(f\) is that the support of \(Y\) is the same as the domain of the function \(f\) and that the probability of a value \(y\) is equal to \(f(y)\), or

(65)#\[\begin{align} supp(Y) &= dom(f)\\ P(Y = y) &= f(y) \end{align}\]

The probability mass function is a convenient method for assigning probabilities to random variables and visualizing the distribution of a random variable.

Cumulative mass function#

The \textbf{cumulative mass function} is a \underline{function} that maps values in the support of a random variable \(X\) to the probability that the random variable is less than or equal to this value, or \(P(X \leq x)\).

We use a capital \(F\) to denote a cumulative mass function (c.m.f). The c.m.f. corresponding to random variable \(X\) has a domain equal to the support of \(X\) and produces values between 0 and 1 (the values a function can produce is called the function’s \textbf{image}).

(66)#\[\begin{align} supp(X) &= dom(F)\\ image(F) &= [0,1]\\ P(X \leq x) &= F(x) \end{align}\]

The c.m.f. too can be visualized and we could also use the c.m.f. to describe the probability distribution of a random variable. This is because we can use the c.m.f. to derive the p.m.f.

cmfviz.pdf

Figure A barplot for visualizing the cumulative mass function of our random variable \(Y\). The support of \(Y\) is plotted on the horizontal axis and height of each bar corresponds to the probability assigned to values \((v)\) less than or equal to \(v\) in the support.

\ex For a random variable

(67)#\[\begin{align} X &\sim f\\ supp(X) &= \{0,1,2,3\}\\ f(x) &= \begin{cases} 0 & 0.1\\ 1 & 0.3\\ 2 & 0.2\\ 3 & 0.4\\ \end{cases} \end{align}\]

The c.m.f is then

(68)#\[\begin{align} F(x) = \begin{cases} 0 & 0.1\\ 1 & 0.3 + 0.1 = 0.4\\ 2 & 0.2+0.3+0.1 = 0.6\\ 3 & 0.4+0.2+0.3+0.1 = 1\\ \end{cases} \end{align}\]

We can use the c.m.f \((F(x))\) to compute any p.m.f (\(f(x)\)) too by noticing that for a support \(\{x_{0},x_{1},x_{2},x_{3}, \cdots,x_{n-1}\cdots,x_{n}\}\) where the values in this set are ordered form smallest to largest

(69)#\[\begin{align} f(x_{i}) &= \left[f(x_{i}) + f(x_{i-1}) + \cdots f(x_{0}) \right] - \left[f(x_{i-1}) + \cdots f(x_{0})\right] \\ &=F(x_{i}) - F(x_{i-1}). \end{align}\]

Because the p.d.f. and c.m.f equivalently describe the probability distribution of a random variable we can write \(X\sim F\) or \(X \sim f\).

Functionals of a random variable#

There are times that we may wish to summarize the behavior of a random variable. One common way to describe how probability is distributed among values in the support of a random variable is by computing some function of that random variable.

Expectation#

Suppose we build a random variable \(X\) with a corresponding probability mass function \(f_{X}\).

The \textbf{expected value} of a random variable \(X\) is computed as

(70)#\[\begin{align} \mathbb{E}\left(X \right) &= P(X=x_{1})x_{1} +P(X=x_{2})x_{2} + \cdots + P(X=x_{n})x_{n}\\ &= f(x_{1})x_{1} + f(x_{2})x_{2} + \cdots + f(x_{n})x_{n} \end{align}\]

where \(x_{1},x_{2},\cdots,x_{n}\) are all values in the \(supp(X)\).

An intuitive definition of the expected value is that \(\mathbb{E}(X)\) is a weighted average of all values in the support of \(X\) where the weight for \(x_{i}\) is the probability of \(x_{i}\). The expected value of \(X\) will be close to values in \(supp(X)\) with high probability.

Example Build a random variable \(Y\) with support \(supp(Y) = \{-1,0,1\}\) and \(f_{Y} = \{(-1,0.2),(0,0.5),(1,0.3)\}\). The expected value of \(Y\) is \(\mathbb{E}(Y) = 0.2 (-1) + 0.5 (0) + 0.3 (1) = 0.1\).

Properties of the expectation#

The expectation is a linear function, that is \(\mathbb{E}(aY + b) = a\mathbb{E}(Y) + b \). We can show this by defining a random variable \(Z = aY+b\) and asking

(71)#\[\begin{align} \mathbb{E}(Z) &= z_{1} f_{Z}(z_{1}) + z_{2} f_{Z}(z_{2}) + \cdots z_{n} f_{Z}(z_{n})\\ &= (ay_{1}+b) f_{Z}(z_{1}) + (ay_{2}+b) f_{Z}(z_{2})+ \cdots (ay_{n}+b) f_{Z}(z_{n})\\ & \begin{aligned} &= a \left[ y_{1}f_{Z}(z_{1}) + y_{2}f_{Z}(z_{2}) + \cdots y_{n}f_{Z}(z_{n}) \right] \\ &+ b\left[f_{Z}(z_{1}) + f_{Z}(z_{2}) + \cdots + f_{Z}(z_{n}) \right] \end{aligned} \\ &= a \left[ y_{1}f_{Z}(z_{1}) + y_{2}f_{Z}(z_{2}) + \cdots y_{n}f_{Z}(z_{n}) \right] + b \hspace{2mm} \text{(why?)} \\ &= a \left( y_{1}f_{Y}(y_{1}) + y_{2}f_{Y}(y_{2}) + \cdots y_{n}f_{Y}(y_{n}) \right) + b \\ &= a \mathbb{E}(Y) + b \end{align}\]

The step \eqref{exp.last} deserves some attention. Values \(z_{i}\) are equal to \(ay_{i} + b\), they are mapped from the values \(y_{i}\) and so the probability that \(Z\) equals \(z_{i}\) is equivalent to the probability that \(Y\) equals \(y_{i}\).

Second moment and variance#

Define a random variable \(Y\) with \(supp(Y) - \{y_{1},y_{2},\cdots,y_{n}\}\), then \textbf{variance} is the following function of \(Y\)

(72)#\[\begin{align} \begin{aligned} V(Y) = \left[y_{1} - \mathbb{E}(Y) \right]^{2} P(Y=y_{1}) + \left[y_{2} - \mathbb{E}(Y) \right]^{2} P(Y=y_{2}) +\\ \cdots + \left[y_{n} - \mathbb{E}(Y) \right]^{2} P(Y=y_{n}) \end{aligned} \end{align}\]

The variance can be thought of as the squared distance of each value in the support of the random variable \(Y\) from the expected value weighted by the probability of each value. In some sense, the variance attempts to measure the squared distance from the expected value.

\ex Define a random variable \(Z\) with probability mass function

(73)#\[\begin{align} f_{Z}(z) = \begin{cases} 0.14 & \text{ if } z=0\\ 0.39 & \text{ if } z=1\\ 0.21 & \text{ if } z=2\\ 0.26 & \text{ if } z=4\\ \end{cases} \end{align}\]

To compute \(V(Z)\) we need to first compute the expected value of \(Z\) or \(\mathbb{E}(Z)\):

(74)#\[\begin{align} \mathbb{E}(Z) &= f_{Z}(0) \cdot 0 + f_{Z}(1) \cdot 1 + f_{Z}(2) \cdot 2 + f_{Z}(4) \cdot 4\\ &= 0.14 \cdot 0 + 0.39 \cdot 1 + 0.21 \cdot 2 + 0.26 \cdot 4\\ &= 1.85 \end{align}\]

Now we can compute the variance

(75)#\[\begin{align} &\begin{aligned} V(Z) = (0-1.85)^{2} \cdot f_{Z}(0) + (1-1.85)^{2} \cdot f_{Z}(1) +\\ (2-1.85)^{2} \cdot f_{Z}(2) + (4-1.85)^{2} \cdot f_{Z}(4) \end{aligned} \\ &\begin{aligned} &=(0-1.85)^{2} \cdot 0.14 + (1-1.85)^{2} \cdot 0.39 +\\ &(2-1.85)^{2} \cdot 0.21 + (4-1.85)^{2} \cdot 0.26 \end{aligned} \\ &= 1.97 \end{align}\]

Just in time summation#

Summing a sequence of values is performed so frequently in mathematics and statistics that we have developed a special notation that simplifies sums.

Given a sequences of values, \(x_{1}, x_{2}, x_{3}, \cdots, x_{n}\), that we wish to sum define the following operator to represent that sum

(76)#\[\begin{align} \sum_{i=1}^{n} x_{i} = x_{1} + x_{2} + \cdots + x_{n} \end{align}\]

How the expected value earned its name#

Markov inequality#

(77)#\[\begin{align} P(X > a) < \frac{ \mathbb{E}(X) }{a} \end{align}\]

Suppose we define a random variable \(X\) with a support that takes only non-negative numbers.

(78)#\[\begin{align} \mathbb{E}(X) &= \sum_{ x_{i} \in supp(X)} x_{i} f_{X}(x_{i}) \\ \mathbb{E}(X) &= \sum_{ x_{i} \leq a } x_{i} f_{X}(x_{i}) + \sum_{ x_{i} > a } x_{i} f_{X}(x_{i}) \\ \mathbb{E}(X) &> \sum_{ x_{i} > a } x_{i} f_{X}(x_{i}) \\ \mathbb{E}(X) &> \sum_{ x_{i} > a } a f_{X}(x_{i}) \\ \mathbb{E}(X) &> a \sum_{ x_{i} > a } f_{X}(x_{i}) \\ \mathbb{E}(X) &> a P(X > a) \\ \frac{\mathbb{E}(X)}{a} &> P(X > a) \\ \end{align}\]

Chebychev’s inequality#

Covariance#

Correlation#

Exercises#

  1. Suppose \(\mathcal{S} = \{ a,b,c \}\) and \(P(\{a\})=0.2\), \(P(\{b\})=0.3\), \(P(\{c\})=0.5\). Define a random variable \(X\) such that \(X(a)=1\), \(X(b)=1\), and \(X(c)=0\). Define a second random variable \(Y\) such that \(Y(a)=0\), \(Y(b) = 1\), \(Y(c)=2\).

    1. Compute \(P(X=1)\)

    2. Compute \(P(X=0)\)

    3. What is \(supp(X)\) ?

    4. What new sample space does \(X\) generate?

    5. Compute \(P(Y=1)\)

    6. Compute \(P(Y=0)\)

    7. What is \(supp(Y)\) ?

    8. What new sample space does \(Y\) generate?

  2. Let \(\mathcal{S} = \{1,2,3,4,5,6,7,8,9,10,11\}\), the set of all positive integers. Further define a random variable \(K\) with the following probability mass function

    \[\begin{align*} f(k) = \left( \frac{1}{2} \right) ^{k} & \text{when } k \leq 10\\ f(11) = 0.009 \end{align*} \]
    1. Is the pmf \(f\) a valid probability distribution? Why or why not?

    2. What value of \(K\) is assigned the highest probability?

    3. Please define the cumulative mass function for the random variable \(K\)

  3. Define a random variable with \(supp(Y) = \{-3,-2,-1,0,1,2,3\}\) and cumulative mass function

    \[\begin{align*} F(y) = \begin{cases} 0.10 & \text{ when } y = -3\\ 0.24 & \text{ when } y = -2\\ 0.36 & \text{ when } y = -1\\ 0.50 & \text{ when } y = 0\\ 0.67 & \text{ when } y = 1\\ 0.78 & \text{ when } y = 2\\ 1.00 & \text{ when } y = 3\\ \end{cases} \end{align*} \]
    1. What is \(P(Y \leq -1)\)

    2. What is \(P(Y=-1)\)

    3. What is \(P(Y= 1)\)

    4. Please define the p.m.f for the random variable \(Y\).

    5. Graph the c.m.f

    6. Graph the p.m.f

  4. Compute \(\sum_{i=-5}^{i=5} i^{2}/2\)

  5. Simplify the \(V(X)\) into the equation \(E(X^{2}) - \left[E(X)\right]^{2}\). Hint: Write down the definition of variance using the expectation, then expand the squared terms and simplify.

  6. Define the random variable \(A\) with pmf

    (79)#\[\begin{align} f_{B}(b) = \begin{cases} 0.52 &\text{ if } b=0\\ 0.12 &\text{ if } b=1\\ 0.34 &\text{ if } b=2 \end{cases} \end{align} \]
    1. Compute \(\mathbb{E}(B)\)

    2. Compute \(V(B)\)

    3. Use Chebychev’s inequality to make a statement about \(P(B \geq b)\)