Homework 01

from IPython.display import display, HTML

display(HTML("""
<style>
.callout { padding:12px 14px; border-radius:10px; margin:10px 0; }
.callout.note { background:#eff6ff; border-left:6px solid #3b82f6; }
.callout.warn { background:#fff7ed; border-left:6px solid #f97316; }
.callout.good { background:#ecfdf5; border-left:6px solid #10b981; }
.callout.bad  { background:#fef2f2; border-left:6px solid #ef4444; }
</style>
"""))


display(HTML("""
<style>
/* Base callout */
.callout {
  padding: 12px 14px;
  border-radius: 10px;
  margin: 12px 0;
  line-height: 1.35;
  border-left: 6px solid;
  box-shadow: 0 1px 2px rgba(0,0,0,0.06);
}

/* NOTE variant */
.callout.note {
  background: #eff6ff;         /* light blue */
  border-left-color: #3b82f6;   /* blue */
}

.callout.note {
  background: #dcfce7;         /* minty green */
  border-left-color: #22c55e;   /* green */
}



/* Optional: title line inside */
.callout .title {
  font-weight: 700;
  margin-bottom: 6px;
}
</style>
"""))
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt

Homework 01#

Problem one

For the “crowd accident” dataset here, the below code stratifies observed fatalities caused by accidents into two groups: before the year 2000 (called fatalities_before2000) and after, or equal, to the year 2000 (called fatalities_after2000).

The code below downloads these two lists and does not need to be modified.

Please compute the mean, median, and standard deviation for the number of fatalities before and after the year 2000. What do these summary metrics tell you about the lethality of crowd accidents over time?

import pandas as pd 
d = pd.read_csv("https://zenodo.org/records/7523480/files/accident_data_numeric.csv?download=1")

d_before2000 = d.loc[d.Year<2000]
d_after2000  = d.loc[d.Year>=2000]

fatalities_before2000 =  d_before2000["Fatalities"]
fatalities_after2000  =  d_after2000["Fatalities"]

Problem two

Given a dataset \(\mathcal{D}\), the median absolute deviation (MAD) is defined as the median of each individual datapoints difference from the median

(223)#\[\begin{align} \text{MAD}(\mathcal{D}) = \text{median}( |d_{i}-\text{median}(\mathcal{D})| ) \end{align}\]

where \(\text{median}(A)\) is the median of that set of datapoints contained in \(A\). The symbol \(||\) is the absolute value function. This can be called in python with numpy.abs.

Please write a python function that takes as input a list of values and outputs the MAD. This function can, and should, import numpy to make this computation easier. Use this function to compute the MAD for the number of fatalities before 2000 and the number of fatalities after 2000. What does the MAD tell you when compared to the standard deviation? Why might these values be different?

Problem three

Recall that the variance is computed as

(224)#\[\begin{align} v(\mathcal{D}) \approx N^{-1} \sum_{i=1}^{N} \left( d_{i} - \bar{d} \right)^{2} \\ \end{align}\]

(a) Use algebra to simplify the variance computation.

  1. Hint: expand the squared expression.

  2. Hint: Note that \(N\bar{d} = \sum_{i=1}^{N} d_{i}\)

(b) Consider a vector of data points \(d = [d_{1}, d_{2}, \cdots, d_{n}]^{T}\). Write down an expression using the vector, inner product, and subtraction/division for the variance.

(c) Please show that

(225)#\[\begin{align} v([ a d_{1}, a d_{2}, \cdots a d_{n} ]) = a^{2} v([ d_{1}, d_{2}, \cdots d_{n} ]) \end{align}\]

for \(a\) a constant value.

(d) Suppose you collect a single dataset of paired observations: \(D = [ (x_{1}, y_{1}), (x_{2}, y_{2}), \cdots, (x_{n}, y_{n}) ] \) Similar to measured of central tendancy and measures of dispersion, we want to compute a measure of association. The measure that we will choose is the covariance. The covariance (estimator) is defined as

(226)#\[\begin{align} Cov( X, Y ) = \frac{1}{N} \sum_{i=1}^{N} (x_{i} - \bar{x})(y_{i}-\bar{y}) \end{align}\]

where \(X\) corresponds to a dataset of all the x values (ie \([x_{1}, x_{2}, \cdots x_{n}]\) ) and \(Y\) corresponds to all the y values.

  • Show that \(Cov( X,X )\) = \(v(X)\)

  • Show that \(Cov( aX,Y )\) = \(a Cov(X,Y)\)

  • Show that \(Cov( X,aY )\) = \(a Cov(X,Y)\)

(e) Suppose we add a third dataset of \(z\) values. Then please show that

(227)#\[\begin{align} Cov( X+Y, Z ) = Cov(X,Z) + Cov(X,Y) \end{align}\]

where \(X+Y\) is defined as adding together each datapoint in X and in Y, or \(X+Y = [ x_{1}+y_{1}, x_{2}+y_{2}, \cdots ]\)

Problem four

The traditional mean can be easily pushed or pulled when we include in our dataset observations that are extremely far away the majority of our data. These points are often called outliers.

Robust estimators of the central tendancy aim to return similar values between a dataset \(\mathcal{D}\) that does and does not include outliers. We saw one example already—the median. Lets look at two more robust estimators of central tendancy.

(a) The trimmed mean The X% trimmed mean for the dataset \(\mathcal{D}\) first removes largest and the smallest X% of values then second computes the mean of this new dataset.

(b)The X% windosorized mean is the following algorithm to compute the central tendency. Step one: Compute the Xth percentile and call this \(u\) for “upper” Step two: Compute the 100-Xth percentile and call this \(l\) for “lower” Step three: Replace any values in the dataset below \(l\) with the value \(l\). Step four: Replace any values in the dataset above \(u\) with the value \(u\). Step five: Compute the mean

Do not use an existing winsorize function from scipy to complete the following exercise

Write a function that inputs a list of data and the X% percentile. The function should return the X% windosorized mean. Compute the 95% winsorized mean for the dataset below called d

#--build a dataset d that has 500 "typical values" and 50 outliers.
d_part_one = np.random.normal(0,1, size=(500))
d_part_two = np.random.normal(10,0.1, size=(50))
d          = np.append(d_part_one, d_part_two)

#--this is the mean of the dataset d
np.mean(d)
np.float64(0.9160420062124783)

Problem five

(a) The goal for this problem is to add columns to the “disasters” dataset above to make selecting specific types of disasters easier. Use the template code in the notes to add columns for: Floods, Cyclones, Hurricanes, Tornadoes, and Severe Storms.

(b) Compute the mean, median, standard deviation, interquartile range (and also the 25, 75th percentiles) for Floods, Cyclones, Hurricanes, Tornadoes, and Severe Storms. From this exploratory data analysis, which type of storm appears to be most costly?

import pandas as pd 
import numpy as np 

disasters = pd.read_csv("https://raw.githubusercontent.com/computationalUncertaintyLab/dexp_book/refs/heads/main/events-US-1980-2024.csv")
disasters = disasters.assign(Year = lambda x: x["End Date"].astype(str).str[:3+1].astype(int) )
disasters = disasters.loc[ disasters["Unadjusted Cost"] !="TBD" ]

#--change these to floating values
disasters["CPI-Adjusted Cost"] = disasters["CPI-Adjusted Cost"].astype(float)
disasters["Unadjusted Cost"]   = disasters["Unadjusted Cost"].astype(float)
disasters
Name Disaster Begin Date End Date CPI-Adjusted Cost Unadjusted Cost Deaths Year
0 Southern Severe Storms and Flooding (April 1980) Flooding 19800410 19800417 2742.3 706.8 7 1980
1 Hurricane Allen (August 1980) Tropical Cyclone 19800807 19800811 2230.2 590.0 13 1980
2 Central/Eastern Drought/Heat Wave (Summer-Fall... Drought 19800601 19801130 40480.8 10020.0 1260 1980
3 Florida Freeze (January 1981) Freeze 19810112 19810114 2070.6 572.0 0 1981
4 Severe Storms, Flash Floods, Hail, Tornadoes (... Severe Storm 19810505 19810510 1405.2 401.4 20 1981
... ... ... ... ... ... ... ... ...
393 Central and Northeast Severe Weather (June 2024) Severe Storm 20240624 20240626 1704.0 1704.0 3 2024
394 New Mexico Wildfires (June 2024) Wildfire 20240617 20240707 1700.0 1700.0 2 2024
395 Hurricane Beryl (July 2024) Tropical Cyclone 20240708 20240708 7219.0 7219.0 45 2024
396 Central and Eastern Tornado Outbreak and Sever... Severe Storm 20240713 20240716 2435.0 2435.0 2 2024
397 Hurricane Debby (August 2024) Tropical Cyclone 20240805 20240809 2476.0 2476.0 10 2024

398 rows × 8 columns

Dates#

Dates are difficult to work with in the computer, but there are many functions in python to help. Our goal will be to convert the begin date for each disaster to a datetime object and then create a new column that determines if the disaster started between 1980 and 1990, 1991 to 2000, and so on.

Date time objects#

Date time objects, like any other object in Python, have a special set of functions for computing typical tasks with dates. The most common module in python is datetime and you can import this module like from datetime import datetime, timedelta.

Parsing Dates#

In the disaster data frame, Begin date is considered a floating point number. We want to convert this number into a datetime object. The most common way to convert a number into a datetime is to use the strptime function. The strptime function stands for “String Parse into Time”. The inputs to strptime are a string that contains the date you wish to format and the how this date was formatted. There are special symbols used to tell Python how your date was formatted. A list of these formats is here = List of format “Directives”.


Example of using strftime

We can convert the following string “2020-03-20” into a datetime object.


from datetime import datetime, timedelta
date_object = datetime.strptime("2020-03-20", "%Y-%m-%d")

One of the attributes of datetime objects is the attribute year. This will extract the date from our datetime object.

date_object.year
2020

Problem six

(a) Create a function that inputs a string with the format “%Y%m%d” and outputs the year. You will need to import the datetime module. Call this function from_str_to_dt

(b) Apply from_str_to_dt using the asign function in pandas to create a new column in the disasters data frame called “Year”.

(c) Use the assign function to create a new column called “above2000” in the disasters dataframe. This column will equal the value one if the year of the disaster was greater than 2000 and 0 otherwise.

disasters['Begin Date'] = disasters['Begin Date'].astype(str)

def from_str_to_dt(date):
    date_obj = datetime.strptime(date, "%Y%m%d")
    return date_obj.year

disasters = disasters.assign(Year = disasters['Begin Date'].apply(from_str_to_dt))

print(disasters[['Begin Date', 'Year']].head())
  Begin Date  Year
0   19800410  1980
1   19800807  1980
2   19800601  1980
3   19810112  1981
4   19810505  1981