{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "fd8364a5-6fbf-4a69-b004-87d951ac0247",
   "metadata": {},
   "source": [
    "# Homework 03 \n",
    "\n",
    "The goal for this homework will be to become more skilled at building visual summaries of data. \n",
    "This is an important part of exploratory data analysis, as a graph can both summarize and give details that may be too long to describe textually. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "067f1df6-f805-400f-a5f3-30a3ffbedfb3",
   "metadata": {},
   "source": [
    "## Problem one \n",
    "\n",
    "The column titled \"target\" identifies patients who were diagnosed with (value 1) or without (value 0) heart disease.  \n",
    "Lets use visuals to explore patient attributes that may help identify patients with (or without) heart disease. \n",
    "\n",
    "**Task** \n",
    "Build visuals to explore differences among patients with or without heart disease for the following variables in the ```heart disease``` dataframe: (1) age, (2) sex, (3) trestbps, (4) chol, (5) restecg. \n",
    "You are free to select the visual that you think best describes this relationship. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "ff177962-6a8c-4ac2-83e2-e211a1c783c9",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>sex</th>\n",
       "      <th>cp</th>\n",
       "      <th>trestbps</th>\n",
       "      <th>chol</th>\n",
       "      <th>fbs</th>\n",
       "      <th>restecg</th>\n",
       "      <th>thalach</th>\n",
       "      <th>exang</th>\n",
       "      <th>oldpeak</th>\n",
       "      <th>slope</th>\n",
       "      <th>ca</th>\n",
       "      <th>thal</th>\n",
       "      <th>target</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>52</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>125</td>\n",
       "      <td>212</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>168</td>\n",
       "      <td>0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>53</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>140</td>\n",
       "      <td>203</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>155</td>\n",
       "      <td>1</td>\n",
       "      <td>3.1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>70</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>145</td>\n",
       "      <td>174</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>125</td>\n",
       "      <td>1</td>\n",
       "      <td>2.6</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>61</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>148</td>\n",
       "      <td>203</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>161</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>62</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>138</td>\n",
       "      <td>294</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>106</td>\n",
       "      <td>0</td>\n",
       "      <td>1.9</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1020</th>\n",
       "      <td>59</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>140</td>\n",
       "      <td>221</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>164</td>\n",
       "      <td>1</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1021</th>\n",
       "      <td>60</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>125</td>\n",
       "      <td>258</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>141</td>\n",
       "      <td>1</td>\n",
       "      <td>2.8</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1022</th>\n",
       "      <td>47</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>110</td>\n",
       "      <td>275</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>118</td>\n",
       "      <td>1</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1023</th>\n",
       "      <td>50</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>110</td>\n",
       "      <td>254</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>159</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1024</th>\n",
       "      <td>54</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>120</td>\n",
       "      <td>188</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>113</td>\n",
       "      <td>0</td>\n",
       "      <td>1.4</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1025 rows × 14 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  \\\n",
       "0      52    1   0       125   212    0        1      168      0      1.0   \n",
       "1      53    1   0       140   203    1        0      155      1      3.1   \n",
       "2      70    1   0       145   174    0        1      125      1      2.6   \n",
       "3      61    1   0       148   203    0        1      161      0      0.0   \n",
       "4      62    0   0       138   294    1        1      106      0      1.9   \n",
       "...   ...  ...  ..       ...   ...  ...      ...      ...    ...      ...   \n",
       "1020   59    1   1       140   221    0        1      164      1      0.0   \n",
       "1021   60    1   0       125   258    0        0      141      1      2.8   \n",
       "1022   47    1   0       110   275    0        0      118      1      1.0   \n",
       "1023   50    0   0       110   254    0        0      159      0      0.0   \n",
       "1024   54    1   0       120   188    0        1      113      0      1.4   \n",
       "\n",
       "      slope  ca  thal  target  \n",
       "0         2   2     3       0  \n",
       "1         0   0     3       0  \n",
       "2         0   0     3       0  \n",
       "3         2   1     3       0  \n",
       "4         1   3     2       0  \n",
       "...     ...  ..   ...     ...  \n",
       "1020      2   0     2       1  \n",
       "1021      1   1     3       0  \n",
       "1022      1   1     2       0  \n",
       "1023      2   0     2       1  \n",
       "1024      1   1     3       0  \n",
       "\n",
       "[1025 rows x 14 columns]"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd \n",
    "\n",
    "heart_disease = pd.read_csv(\"heart.csv\")\n",
    "heart_disease"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4f3a2469-396a-48d0-9844-f169286a2311",
   "metadata": {},
   "source": [
    "## Problem two\n",
    "\n",
    "H5 bird flu is widespread in wild birds worldwide and is causing outbreaks in poultry and U.S. dairy cows with several recent human cases in U.S. dairy and poultry workers.\n",
    "\n",
    "While the current public health risk is low, CDC is watching the situation carefully and working with states to monitor people with animal exposures.\n",
    "\n",
    "CDC is using its flu surveillance systems to monitor for H5 bird flu activity in people.\n",
    "\n",
    "More information can be found here = [H5](https://www.cdc.gov/bird-flu/h5-monitoring/index.html)\n",
    "\n",
    "The below code imports from the internet the number of monthly cases of H5 in humans for each country in the world from 1997 to present.\n",
    "We will use this dataset to explore the number of cases in the United States. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "id": "354928dd-e517-49a4-8ca2-42e67d32be93",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "       Entity      Code         Day  avian_cases_month\n",
      "0      Africa       NaN  1997-01-01                  0\n",
      "1      Africa       NaN  1997-02-01                  0\n",
      "2      Africa       NaN  1997-03-01                  0\n",
      "3      Africa       NaN  1997-04-01                  0\n",
      "4      Africa       NaN  1997-05-01                  0\n",
      "...       ...       ...         ...                ...\n",
      "10349   World  OWID_WRL  2024-06-01                  0\n",
      "10350   World  OWID_WRL  2024-07-01                 13\n",
      "10351   World  OWID_WRL  2024-08-01                  2\n",
      "10352   World  OWID_WRL  2024-09-01                  1\n",
      "10353   World  OWID_WRL  2024-10-01                 13\n",
      "\n",
      "[10354 rows x 4 columns]\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import requests\n",
    "\n",
    "# Fetch the data.\n",
    "df = pd.read_csv(\"https://ourworldindata.org/grapher/h5n1-flu-reported-cases.csv?v=1&csvType=full&useColumnShortNames=true\", storage_options = {'User-Agent': 'Our World In Data data fetch/1.0'})\n",
    "\n",
    "# Fetch the metadata\n",
    "metadata = requests.get(\"https://ourworldindata.org/grapher/h5n1-flu-reported-cases.metadata.json?v=1&csvType=full&useColumnShortNames=true\").json()\n",
    "\n",
    "print(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1da062a1-fa8b-412e-86d5-ff5bdf661e62",
   "metadata": {},
   "source": [
    "### Problem two Task 1\n",
    "\n",
    "Subset the dataframe ```df``` to H5 cases in the United states on or after \"2022-01-01\".\n",
    "You'll need to use ```loc```. \n",
    "Call this subsetted dataframe ```us```."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "753abbda-b025-4d72-8848-bc1b0770b44c",
   "metadata": {},
   "source": [
    "### Problem two Task 2\n",
    "\n",
    "Draw a barplot using seaborn such that the horizontal axis describes months and the vertical axis the number of H5 cases. \n",
    "Make sure to label the horizontal and vertical axes. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a16275e3-d7fa-4154-ab12-60e03ba75537",
   "metadata": {},
   "source": [
    "### Problem two Task 3\n",
    "\n",
    "The xtick labels are cluttered, making it hard to tell which months we are presenting to the reader. \n",
    "Seaborn doesn't make it clear, right away, how it plotted this data.\n",
    "Luckily, seaborn uses as its foundation matplotlib. \n",
    "\n",
    "#### Get \n",
    "\n",
    "Many of the attributes of an axis can be extarcted by appending a \"get\", an underscore, and then the attribute. \n",
    "For example, if we want to see how the xticks and xtick labels were created for the above plot we can write"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "4f88f29e-b2d1-437c-b509-5ddac44d23e6",
   "metadata": {},
   "outputs": [],
   "source": [
    "xticks      = ax.get_xticks()\n",
    "xticklabels = ax.get_xticklabels()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5e1223f9-c8b5-497a-8123-19464df88ce8",
   "metadata": {},
   "source": [
    "**Task** \n",
    "Set the xticks such that the first xtick is drawn, then the third, fifth, and so on. \n",
    "Use the keyword ```rotation``` to set the xticklabels such that the are rotated 45 degrees. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fd34dda0-d950-494b-9ca7-79d03fa5697f",
   "metadata": {},
   "source": [
    "## Problem Three\n",
    "\n",
    "One way to measure the association between two amount or balance variables is by using the **product-moment correlation**.\n",
    "The product-moment correlation is computed as \n",
    "\n",
    "\\begin{align}\n",
    "    \\rho(X,Y) = N^{-1} \\sum_{i=1}^{N} \\frac{(x_{i} - \\bar{x})(y_{i} - \\bar{y})}{ \\text{sd}(X)\\text{sd}(Y)  }\n",
    "\\end{align}\n",
    "\n",
    "Though the above computations looks opaque, the  **product-moment correlation** (also called the correlation coefficient) has an intuitive explanation.\n",
    "A single data point contributes to the correlation coefficient \n",
    "\n",
    "\\begin{align}\n",
    "    \\frac{(x_{i} - \\bar{x})(y_{i} - \\bar{y})}{ \\text{sd}(X)\\text{sd}(Y)}\n",
    "\\end{align}\n",
    "\n",
    "The first term $x_{i} - \\bar{x}$ measures the value of $x_{i}$ compared to its mean. \n",
    "If $x_{i}$ is greater than the mean than this expression is positive. \n",
    "If $x_{i}$ is less than the mean that this expression is negative. \n",
    "\n",
    "The product $(x_{i} - \\bar{x})(y_{i} - \\bar{y})$ will be large when both $(x_{i} - \\bar{x})$ and $(y_{i} - \\bar{y})$ are large. Otherwise the contribution of this data point---represented as the pair $(x_{i},y_{i})$---will be small. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e5ae4a98-f245-4d6c-8f87-6c58917fb7c5",
   "metadata": {},
   "source": [
    "### Problem Three Task 1\n",
    "\n",
    "Intuitively, the \"most correlated\" two variables $X$ and $Y$ could be is if their values are equal.\n",
    "That is, if $\\mathcal{D} = ( (x_{1},x_{1}),(x_{2},x_{2}), \\cdots, (x_{n},x_{n}) )$. \n",
    "\n",
    "Please show the value of the correlation coefficient equals under this condition. \n",
    "\n",
    "### Problem Three Task 2\n",
    "\n",
    "It is also possible that the two variables $X$ $Y$ can be perfectly correlated but opposite of one another. \n",
    "In otherwords, whenever we observe the value $x$ we observe the value $-x$.\n",
    "Show what the correlation coefficient would equal under this condition. \n",
    "\n",
    "### Problem Three Task 3\n",
    "\n",
    "Another way to understand the correlation coefficient is to first form the variables \n",
    "\n",
    "\\begin{align}\n",
    "    z_{x_{i}} &= \\frac{x_{i} - \\bar{X} }{\\text{sd}(X)}\\\\\n",
    "    z_{y_{i}} &= \\frac{y_{i} - \\bar{y} }{\\text{sd}(Y)}\n",
    "\\end{align}\n",
    "\n",
    "Substitute $X$ for $Z_{X}$ and $Y$ for $Z_{Y}$ in the expression $\\rho(X,Y)$.\n",
    "\n",
    "### Problem Three Task 4\n",
    "\n",
    "Lets look again at the association between age and cholesterol levels for the heart disease data. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "37970238-caf3-4406-831d-f9d815f6cf64",
   "metadata": {},
   "outputs": [],
   "source": [
    "fig,ax = plt.subplots()\n",
    "ax.scatter(heart_disease[\"age\"], heart_disease[\"chol\"])\n",
    "\n",
    "ax.set_xlabel(\"Age\")\n",
    "ax.set_ylabel(\"Cholesterol\")\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4458c280-c0cc-4117-9715-5686c8e494bd",
   "metadata": {},
   "source": [
    "**Task**: Use the ```assign``` function in pandas to create two new columns:\n",
    "1. z_age which will be each patient age minus the mean age and divided by the standard deviation.\n",
    "2. z_chol which will be each patient cholesterol level minus the mean cholesterol level and divided by the standard deviation."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0c1eb70-18d6-454d-8956-b1c1e098f8a3",
   "metadata": {},
   "source": [
    "### Problem Three Task 5\n",
    "Use a scatter plot to visualize z_age and z_chol. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bff278b4-7010-4cfb-b80d-a1c41d130baf",
   "metadata": {},
   "source": [
    "### Problem Three Task 6\n",
    "Compute $\\rho$ for age and cholesterol. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4188dd52-f090-4102-b050-3928153997df",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}