Homework 01#

Problem one#

For the “crowd accident” dataset here, the below code stratifies observed fatalities caused by accidents into two groups: before the year 2000 (called fatalities_before2000) and after, or equal, to the year 2000 (called fatalities_after2000).

The code below downloads these two lists and does not need to be modified.

import pandas as pd 
d = pd.read_csv("https://zenodo.org/records/7523480/files/accident_data_numeric.csv?download=1")

d_before2000 = d.loc[d.Year<2000]
d_after2000  = d.loc[d.Year>=2000]

fatalities_before2000 =  d_before2000["Fatalities"]
fatalities_after2000  =  d_after2000["Fatalities"]
  1. Please compute the mean, median, and standard deviation for the number of fatalities before and after the year 2000. What do these summary metrics tell you about the lethality of crowd accidents over time?

Problem two#

Given a dataset \(\mathcal{D}\), the median absolute deviation (MAD) is defined as the median of each individual datapoints difference from the median

(12)#\[\begin{align} \text{MAD}(\mathcal{D}) = \text{median}( |d_{i}-\text{median}(\mathcal{D})| ) \end{align}\]

where \(\text{median}(A)\) is the median of that set of datapoints contained in \(A\). The symbol \(||\) is the absolute value function. This can be called in python with numpy.abs.

  1. Please write a python function that takes as input a list of values and outputs the MAD. This function can, and should, import numpy to make this computation easier. Use this function to compute the MAD for the number of fatalities before 2000 and the number of fatalities after 2000. What does the MAD tell you when compared to the standard deviation? Why might these values be different?

Problem three#

Recall that the variance is computed as

(13)#\[\begin{align} v(\mathcal{D}) \approx N^{-1} \sum_{i=1}^{N} \left( d_{i} - \bar{d} \right)^{2} \\ \end{align}\]

Use algebra to simplify the variance computation.
Hint, expand the squared expression.
Hint2, Note that \(N\bar{d} = \sum_{i=1}^{N} d_{i}\)

Problem four#

Robust estimators of the central tendancy aim to return similar values between a dataset \(\mathcal{D}\) and a second dataset that is \(\mathcal{D}\) plus some uncharacteristically large or small value. We saw one example already—the median. Lets look at two more.

The X% trimmed mean for the dataset \(\mathcal{D}\) first removes largest and the smallest X% of values then second computes the mean of this new dataset.

  1. Using the above crowd accident fatalities before and after 2000, compute the 5% trimmed mean. You can use the trim_mean function in scipy.stats to accomplish this. Compare the trimmed mean, mean, and median. Why are these values different?

Problem five#

The X% windosorized mean is the following algorithm to compute the central tendency. Step one: Compute the Xth percentile and call this \(u\) for “upper” Step two: Compute the 100-Xth percentile and call this \(l\) for “lower” Step three: Replace any values in the dataset below \(l\) with the value \(l\). Step four: Replace any values in the dataset above \(u\) with the value \(u\). Step five: Compute the mean

Do not use an existing winsorize funciton from scipy to complete the following exercise

Write a function that inputs a list of data and the X% percentile. The function should return the X% windosorized mean. Compute the 95% winsorized mean for fatalties before and after 2000 due to crowd accidents.