Chapter 01: Exploratory Data Analysis Part I#

Exploratory vs Confirmatory data analysis#

Exploratory data analysis (EDA) is the study of: the potential for data to answer hypotheses and how to generate novel hypotheses by the study of relationships between collected information. This is in contrast to confirmatory data analysis. Confirmatory data analysis techniques are hypothesis tests, confidence intervals, statistical estimators—methods by which we seek to understand all possible experiments by observing a smaller number of experiments.

Both analyses are important. We will start with EDA.

Definition of an Experiment#

An experiment is any process that generates an observation. An observation can be measured and this measurement, or data, can be classified into one of several types.

Data Types#

Observations can be classified into: names, grades, ranks, fractions and percentages, counts, amounts, and balances.
This classification system is called the Mosteller-Tukey classification system. There exist other systems and Stevens Typology—classifying data as nominal, ordinal, interval, and ratio—is another popular choice.

Amounts and Counts:#

Both amounts and counts are non-negative values without a necessary upper bound. Amounts are real-valued and counts are integers.

Balances:#

Balances are observations that can be positive or negative real values.

Counted Fractions and Percentages:#

Counted fractions are values \(x/y\) where \(x\) and \(y\) are integers and \(x \leq y\). Percentages are any real value between 0 and 100. Importantly, both counted fractions and percentages are bounded. Fractions are bounded between 0 and 1, and percentages between 0 and 100.

Grades, Ranks, and Names#

In the Mosteller-Tukey classification system there also exist: Names, Grades, and Ranks. We are less likely to use these data classifications in this course, but they should still be briefly discussed.

Data that can be ‘Graded’ can be assigned one of set of levels. These levels are sorted such that when you compare two graded observations you can say that one observation is ‘higher’ or ‘larger’ than the other. A set of \(N\) observations can be ranked if each observation can be assigned an integer from 1 to N such that an observation with rank \(x\) is larger than an observation with rank \(y\) is \(x>y\). Named data cannot be compared to one another.

Descriptive Statistics#

The goal of descriptive statistics are to characterize a set of observations, or pairs of observations with a smaller set of numbers. Though there is a loss of information by describing a set of \(N\) observations by a smaller \(n<N\) set of numbers, descriptive statistics can help us quickly understand our the data that we have collected and what we can do with this data. Common descriptive statistics are the mean, median, standard deviation etc.

Measures of Central Tendancy#

A measure of central tendency aims to use a single value to describe where observations are most probable. For counted data, a natural measure of central tendency is the mode. Given a dataset of \(n\) observations, \(\mathcal{D} = (d_{1}, d_{2}, \cdots, c_{n})\) the mode is the most frequently occurring value.


Example: Crowd accidents* A dataset of crowd accidents—defined as situations where mass gatherings of people lead to deaths or injuries—was collected by authors and published here = paper. The dataset is stored such that each row is an observation of a crowd accident. The observation has several pieces of information.

We will focus on the counted data “Number of Fatalities”. This is the recorded number of deaths that occurred during the crowd accident.

#--Import pandas. This will allow us to download the dataset from the internet. 
#--More on pandas in the following lectures.
import pandas as pd 
d = pd.read_csv("https://zenodo.org/records/7523480/files/accident_data_numeric.csv?download=1")

#--the dataset is called d. 
#--The number of fatalities for each crowd incident is stored in a list called d["Fatalities"]

#--Scipy has a function called mode that will compute the mode for us. 
from scipy.stats import mode
fatal_mode = mode(d["Fatalities"])

#--Print the output
print("The most frequently occuring number of fatalities for which a crowd accident has occured is below ")
print(fatal_mode)
The most frequently occuring number of fatalities for which a crowd accident has occured is below 
ModeResult(mode=np.int64(0), count=np.int64(24))

The most frequent number of fatalities during a crowd accident is zero, and this number of fatalities was observed 24 times.


For Amount and Balance data, there are several additional measures of central tendency that may be useful. Given a dataset of \(n\) observations, \(\mathcal{D} = (d_{1}, d_{2}, \cdots, c_{n})\), the mean is a value such that the majority of observations are “close” to this value.

The mean is computed as

(1)#\[\begin{align} \bar{x} = N^{-1}\sum_{i=1}^{n} d_{i} \end{align}\]

There are two intuitive explanations of the mean. The first explanation suggests that the mean takes into account the frequency of each observation in the dataset.

Consider the following dataset:

(2)#\[\begin{align} \mathcal{D} = (1,2,3,5,3,2,1,7,4,1,7,7,1,2) \end{align}\]

using the above formula for the mean, we would compute

(3)#\[\begin{align} \bar{x} = \frac{1+2+3+5+3+2+1+7+4+1+7+7+1+2}{14} \end{align}\]

however addition is commutative. We would arrive at the same mean if we moved all the same values next to another.

(4)#\[\begin{align} \bar{x} &= \frac{1+1+1+1 + 2+2+2 + 3+3 + 4 + 5 + 7 + 7 + 7}{14} \end{align}\]

We can rewrite the above sum as the frequency that each number appears times its value or

(5)#\[\begin{align} \bar{x} &= \frac{ 4 \cdot 1 + 3 \cdot 2 + 2 \cdot 3 + 1 \cdot 4 + 1 \cdot 5 + 3 \cdot 7 }{14} \end{align}\]

and finally we can distribute the denominator

(6)#\[\begin{align} \bar{x} &= \frac{4}{14} 1 + \frac{3}{14} 2 + \frac{2}{14} 3 + \frac{1}{14} 4 + \frac{1}{14} 5 + \frac{3}{14} 7 \\ \end{align}\]

We see that the mean weights each observation by the proportion of times that it appears in the dataset.

The second intuitive explanation is that the mean is the “middle” value. Consider the dataset containing two observations,

(7)#\[\begin{align} \mathcal{D} = (1,7), \end{align}\]

then a natural “middle” value would be half way between 1 and 7 or \( 1 + (7-1)/2 = 4\).


Example: Crowd accidents continued Rather than consider the number of fatalities counted data, we could consider this an amount. In that case, we could compute the average number of fatalities over all crowd accidents that occured from 1975-1980, 1981-1985, in five year increments until 2015-2019. The authors in this publication ran this exploratory data analysis.

import pandas as pd 
d = pd.read_csv("https://zenodo.org/records/7523480/files/accident_data_numeric.csv?download=1")

import numpy as np #--The numpy package (often abbreviated np) has a function to compute the mean. 
mean_fatalities = np.mean(d["Fatalities"])                     #--this is how to compute the mean

print("The average number of fatalties among all crowd accidents is below")
print(mean_fatalities)
The average number of fatalties among all crowd accidents is below
49.07829181494662

Note an important point about the mean, the mean does not need to be one of the values in the dataset. In this case, we see that the mean is the decimal value 49.08 (not a counted value).


Another useful measure of central tendency for amount and balance data is the median. Given a dataset \(\mathcal{D}\), the median is the value \(v\) such that half of the observations are smaller than \(v\) and half of the values are larger than \(v\).


Example: Computing and comparing the median to the mean Consider the crowd accident dataset \(\mathcal{D}\) above. We can compute the median number of fatalities by sorting the values from lowest to highest, finding the middle most value, and extracting that value. It is often easier to use a function in numpy like this.

import pandas as pd 
d = pd.read_csv("https://zenodo.org/records/7523480/files/accident_data_numeric.csv?download=1")

import numpy as np #--The numpy package (often abbreviated np) has a function to compute the mean. 
median_fatalities = np.median(d["Fatalities"])                     #--this is how to compute the mean

print("The median number of fatalties among all crowd accidents is below")
print(median_fatalities)
The median number of fatalties among all crowd accidents is below
10.0

The median and the mean are different, and most measures of central tendency will not agree with one another. To note, the mean will often move higher or lower than the median when there are extremely large or extremely small (or large negative) values. This characteristic of the mean is called “sensitivity to outliers”.


Before we continue, we should describe the weighted average and how the standard deviation (as well as the mean from above) is a type of weighted average. Given a dataset \(\mathcal{D}\), the weighted average is computed as

(8)#\[\begin{align} \omega(\mathcal{D}) &= \frac{\sum_{i=1}^{n} w_{i}d_{i}}{\sum_{j=1}^{n}w_{j}} = \sum_{i=1}^{n} \left(\frac{w_{i}}{\sum_{j=1}^{n}w_{j}}\right) d_{i} \end{align}\]

For the mean above, the weights \(w_{1},w_{2},\cdots\), are all equal to one. The weighted average allows the reader to emphasize some data points more than others. This may be the case if we are more certain about measured observations than others.

Measures of Dispersion#

Measures of dispersion are meant to capture how observations in a dataset are either close, or far, from one another. A set of observations that are spaced far apart from another is said to be highly dispersed.

A common measure of dispersion for counted data is the range accompanied by the min and max values. Given a dataset, \(\mathcal{D} = (d_{1},d_{2},\cdots,d_{n})\), the max is the largest observation, the min is the smallest observation, and the range equals the max minus the min. Personally, the author thinks that the min and max should always be accompanied by the range, but this is under debate.

Example:

A common measure of dispersion for amount and balance data is the standard deviation. The variance is the standard deviation squared and is computed as

(9)#\[\begin{align} v(\mathcal{D}) &= (N-1)^{-1} \sum_{i=1}^{N} \left( d_{i} - \bar{d} \right)^{2} \approx N^{-1} \sum_{i=1}^{N} \left( d_{i} - \bar{d} \right)^{2} \\ \end{align}\]

The variance could be understood by first transforming each data point and then computing the mean. In other words we can apply the function \(f(d) = \left( d - \bar{d} \right)^{2}\) to each data point \(d_{i}\) and compute

(10)#\[\begin{align} v(\mathcal{D}) &= (N-1)^{-1} \sum_{i=1}^{N} f(d_{i}) \approx \frac{\sum_{i=1}^{N} f(d_{i})}{N} \end{align}\]

For each data point we measure the distance from the average \(d_{i} - \bar{d}\), but because we are interested in dispersion either smaller or larger than the average we square this quantity. Squaring each \(d_{i} - \bar{d}\) means that the variance will be non-negative. Large values of the variance indicate large dispersion and vice-versa.

The variance has a problem. Suppose our experiment collects the length of dogs from nose to tail in the units centimeters (cm). Then the average is in units cm but the variance is in units \(\text{cm}^2\). To keep our summary of dispersion on the same scale, we can take the square root of the variance—this is the standard deviation.

(11)#\[\begin{align} s(\mathcal{D}) = \left[v(\mathcal{D})\right]^{1/2} &= \sqrt{(N-1)^{-1} \sum_{i=1}^{N} \left( d_{i} - \bar{d} \right)^{2} } \end{align}\]

Another common measure of dispersion is the interquartile range. The interquartile range equals the 75th percentile - the 25th percentile. Given a dataset \(\mathcal{D} = (d_{1},d_{2},\cdots,d_{n})\), the Xth percentile is the data point \(d_{i}\) such that X percent of the values are smaller than \(d_{i}\) and \(100-X\) percent of the data points are larger than \(d_{i}\). Like the range, the author feels that the interquartile range should be accompanied by the 25th and 75th percentiles that were used to generate this summary. Not everyone agrees.

Example: Spills of produced water by the Texas oil and gas industry from 2013 to 2022 Records of gas and oil spilles recorded by the Texas Railroad Commission was collected by Peter Aldhous link. Each observation in the dataset is an incident and includes the number of gallons of oil spilled during the incident.

We can download this dataset into Python and compute the variance, standard deviation, and interquartile range using the numpy modeule.

#TL;DR for this part
import pandas as pd
d = pd.read_csv("https://raw.githubusercontent.com/InsideClimateNews/2023-10-tx-produced-water-spills/refs/heads/main/data/central_cleaned.csv")
amount_of_crude_oil = d["release_crude_oil_edit"].astype(float)
amount_of_crude_oil = amount_of_crude_oil[~np.isnan(amount_of_crude_oil)] #--Restrict to oil spill recorded
amount_of_crude_oil = amount_of_crude_oil[amount_of_crude_oil>0]          #--Restrict to oil spill

#--Exploratory Data Analysis piece (This is the important part)
import numpy as np
#--central tendancy
mean_crude_oil   = np.mean(amount_of_crude_oil)
median_crude_oil = np.median(amount_of_crude_oil)

#--dispersion
variance_crude_oil = np.var(amount_of_crude_oil)
std_dev_crude_oil  = np.std(amount_of_crude_oil)

#--The interquartile range plus percentiles
_25th_percentile_crude_oil = np.percentile(amount_of_crude_oil, 25)
_75th_percentile_crude_oil = np.percentile(amount_of_crude_oil, 75)
IQR = _75th_percentile_crude_oil - _25th_percentile_crude_oil

print("The mean and standard deviation of the Gallons of Crude oil spilled are")
print(mean_crude_oil)
print(std_dev_crude_oil)

print("The median and interquartile range (with 25,75th percnetiles) of the Gallons of Crude oil spilled are")
print(median_crude_oil)
print(IQR)
print(_25th_percentile_crude_oil)
print(_75th_percentile_crude_oil)
The mean and standard deviation of the Gallons of Crude oil spilled are
1779.2016927184468
7282.019132458342
The median and interquartile range (with 25,75th percnetiles) of the Gallons of Crude oil spilled are
630.0
1188.6
281.40000000000003
1470.0