Homework 03

Homework 03#

The goal for this homework will be to become more skilled at building visual summaries of data. This is an important part of exploratory data analysis, as a graph can both summarize and give details that may be too long to describe textually.

Problem one#

The column titled “target” identifies patients who were diagnosed with (value 1) or without (value 0) heart disease.
Lets use visuals to explore patient attributes that may help identify patients with (or without) heart disease.

Task Build visuals to explore differences among patients with or without heart disease for the following variables in the heart disease dataframe: (1) age, (2) sex, (3) trestbps, (4) chol, (5) restecg. You are free to select the visual that you think best describes this relationship.

import pandas as pd 

heart_disease = pd.read_csv("heart.csv")
heart_disease

	age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	ca	thal	target
0	52	1	0	125	212	0	1	168	0	1.0	2	2	3	0
1	53	1	0	140	203	1	0	155	1	3.1	0	0	3	0
2	70	1	0	145	174	0	1	125	1	2.6	0	0	3	0
3	61	1	0	148	203	0	1	161	0	0.0	2	1	3	0
4	62	0	0	138	294	1	1	106	0	1.9	1	3	2	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1020	59	1	1	140	221	0	1	164	1	0.0	2	0	2	1
1021	60	1	0	125	258	0	0	141	1	2.8	1	1	3	0
1022	47	1	0	110	275	0	0	118	1	1.0	1	1	2	0
1023	50	0	0	110	254	0	0	159	0	0.0	2	0	2	1
1024	54	1	0	120	188	0	1	113	0	1.4	1	1	3	0

1025 rows × 14 columns

Problem two#

H5 bird flu is widespread in wild birds worldwide and is causing outbreaks in poultry and U.S. dairy cows with several recent human cases in U.S. dairy and poultry workers.

While the current public health risk is low, CDC is watching the situation carefully and working with states to monitor people with animal exposures.

CDC is using its flu surveillance systems to monitor for H5 bird flu activity in people.

More information can be found here = H5

The below code imports from the internet the number of monthly cases of H5 in humans for each country in the world from 1997 to present. We will use this dataset to explore the number of cases in the United States.

import pandas as pd
import requests

# Fetch the data.
df = pd.read_csv("https://ourworldindata.org/grapher/h5n1-flu-reported-cases.csv?v=1&csvType=full&useColumnShortNames=true", storage_options = {'User-Agent': 'Our World In Data data fetch/1.0'})

# Fetch the metadata
metadata = requests.get("https://ourworldindata.org/grapher/h5n1-flu-reported-cases.metadata.json?v=1&csvType=full&useColumnShortNames=true").json()

print(df)

       Entity      Code         Day  avian_cases_month
    Africa       NaN  1997-01-01                  0
    Africa       NaN  1997-02-01                  0
    Africa       NaN  1997-03-01                  0
    Africa       NaN  1997-04-01                  0
    Africa       NaN  1997-05-01                  0
...       ...       ...         ...                ...
 World  OWID_WRL  2024-10-01                 30
 World  OWID_WRL  2024-11-01                 14
 World  OWID_WRL  2024-12-01                  9
 World  OWID_WRL  2025-01-01                  2
 World  OWID_WRL  2025-02-01                  4

[10478 rows x 4 columns]

Problem two Task 1#

Subset the dataframe df to H5 cases in the United states on or after “2022-01-01”. You’ll need to use loc. Call this subsetted dataframe us.

Problem two Task 2#

Draw a barplot using seaborn such that the horizontal axis describes months and the vertical axis the number of H5 cases. Make sure to label the horizontal and vertical axes.

Problem two Task 3#

The xtick labels are cluttered, making it hard to tell which months we are presenting to the reader. Seaborn doesn’t make it clear, right away, how it plotted this data. Luckily, seaborn uses as its foundation matplotlib.

Get#

Many of the attributes of an axis can be extarcted by appending a “get”, an underscore, and then the attribute. For example, if we want to see how the xticks and xtick labels were created for the above plot we can write

xticks      = ax.get_xticks()
xticklabels = ax.get_xticklabels()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[3], line 1
----> 1 xticks      = ax.get_xticks()
      2 xticklabels = ax.get_xticklabels()

NameError: name 'ax' is not defined

Task Set the xticks such that the first xtick is drawn, then the third, fifth, and so on. Use the keyword rotation to set the xticklabels such that the are rotated 45 degrees.

Problem Three#

One way to measure the association between two amount or balance variables is by using the product-moment correlation. The product-moment correlation is computed as

(217)#\[\begin{align} \rho(X,Y) = N^{-1} \sum_{i=1}^{N} \frac{(x_{i} - \bar{x})(y_{i} - \bar{y})}{ \text{sd}(X)\text{sd}(Y) } \end{align}\]

Though the above computations looks opaque, the product-moment correlation (also called the correlation coefficient) has an intuitive explanation. A single data point contributes to the correlation coefficient

(218)#\[\begin{align} \frac{(x_{i} - \bar{x})(y_{i} - \bar{y})}{ \text{sd}(X)\text{sd}(Y)} \end{align}\]

The first term \(x_{i} - \bar{x}\) measures the value of \(x_{i}\) compared to its mean. If \(x_{i}\) is greater than the mean than this expression is positive. If \(x_{i}\) is less than the mean that this expression is negative.

The product \((x_{i} - \bar{x})(y_{i} - \bar{y})\) will be large when both \((x_{i} - \bar{x})\) and \((y_{i} - \bar{y})\) are large. Otherwise the contribution of this data point—represented as the pair \((x_{i},y_{i})\)—will be small.

Problem Three Task 1#

Intuitively, the “most correlated” two variables \(X\) and \(Y\) could be is if their values are equal. That is, if \(\mathcal{D} = ( (x_{1},x_{1}),(x_{2},x_{2}), \cdots, (x_{n},x_{n}) )\).

Please show the value of the correlation coefficient equals under this condition.

Problem Three Task 2#

It is also possible that the two variables \(X\) \(Y\) can be perfectly correlated but opposite of one another. In otherwords, whenever we observe the value \(x\) we observe the value \(-x\). Show what the correlation coefficient would equal under this condition.

Problem Three Task 3#

Another way to understand the correlation coefficient is to first form the variables

(219)#\[\begin{align} z_{x_{i}} &= \frac{x_{i} - \bar{X} }{\text{sd}(X)}\\ z_{y_{i}} &= \frac{y_{i} - \bar{y} }{\text{sd}(Y)} \end{align}\]

Substitute \(X\) for \(Z_{X}\) and \(Y\) for \(Z_{Y}\) in the expression \(\rho(X,Y)\).

Problem Three Task 4#

Lets look again at the association between age and cholesterol levels for the heart disease data.

fig,ax = plt.subplots()
ax.scatter(heart_disease["age"], heart_disease["chol"])

ax.set_xlabel("Age")
ax.set_ylabel("Cholesterol")

plt.show()

Task: Use the assign function in pandas to create two new columns:

z_age which will be each patient age minus the mean age and divided by the standard deviation.
z_chol which will be each patient cholesterol level minus the mean cholesterol level and divided by the standard deviation.

Problem Three Task 5#

Use a scatter plot to visualize z_age and z_chol.

Problem Three Task 6#

Compute \(\rho\) for age and cholesterol.