{ "cells": [ { "cell_type": "markdown", "id": "fd8364a5-6fbf-4a69-b004-87d951ac0247", "metadata": {}, "source": [ "# Homework 03 \n", "\n", "The goal for this homework will be to become more skilled at building visual summaries of data. \n", "This is an important part of exploratory data analysis, as a graph can both summarize and give details that may be too long to describe textually. " ] }, { "cell_type": "markdown", "id": "067f1df6-f805-400f-a5f3-30a3ffbedfb3", "metadata": {}, "source": [ "## Problem one \n", "\n", "The column titled \"target\" identifies patients who were diagnosed with (value 1) or without (value 0) heart disease. \n", "Lets use visuals to explore patient attributes that may help identify patients with (or without) heart disease. \n", "\n", "**Task** \n", "Build visuals to explore differences among patients with or without heart disease for the following variables in the ```heart disease``` dataframe: (1) age, (2) sex, (3) trestbps, (4) chol, (5) restecg. \n", "You are free to select the visual that you think best describes this relationship. " ] }, { "cell_type": "code", "execution_count": 36, "id": "ff177962-6a8c-4ac2-83e2-e211a1c783c9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agesexcptrestbpscholfbsrestecgthalachexangoldpeakslopecathaltarget
052101252120116801.02230
153101402031015513.10030
270101451740112512.60030
361101482030116100.02130
462001382941110601.91320
.............................................
102059111402210116410.02021
102160101252580014112.81130
102247101102750011811.01120
102350001102540015900.02021
102454101201880111301.41130
\n", "

1025 rows × 14 columns

\n", "
" ], "text/plain": [ " age sex cp trestbps chol fbs restecg thalach exang oldpeak \\\n", "0 52 1 0 125 212 0 1 168 0 1.0 \n", "1 53 1 0 140 203 1 0 155 1 3.1 \n", "2 70 1 0 145 174 0 1 125 1 2.6 \n", "3 61 1 0 148 203 0 1 161 0 0.0 \n", "4 62 0 0 138 294 1 1 106 0 1.9 \n", "... ... ... .. ... ... ... ... ... ... ... \n", "1020 59 1 1 140 221 0 1 164 1 0.0 \n", "1021 60 1 0 125 258 0 0 141 1 2.8 \n", "1022 47 1 0 110 275 0 0 118 1 1.0 \n", "1023 50 0 0 110 254 0 0 159 0 0.0 \n", "1024 54 1 0 120 188 0 1 113 0 1.4 \n", "\n", " slope ca thal target \n", "0 2 2 3 0 \n", "1 0 0 3 0 \n", "2 0 0 3 0 \n", "3 2 1 3 0 \n", "4 1 3 2 0 \n", "... ... .. ... ... \n", "1020 2 0 2 1 \n", "1021 1 1 3 0 \n", "1022 1 1 2 0 \n", "1023 2 0 2 1 \n", "1024 1 1 3 0 \n", "\n", "[1025 rows x 14 columns]" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd \n", "\n", "heart_disease = pd.read_csv(\"heart.csv\")\n", "heart_disease" ] }, { "cell_type": "markdown", "id": "4f3a2469-396a-48d0-9844-f169286a2311", "metadata": {}, "source": [ "## Problem two\n", "\n", "H5 bird flu is widespread in wild birds worldwide and is causing outbreaks in poultry and U.S. dairy cows with several recent human cases in U.S. dairy and poultry workers.\n", "\n", "While the current public health risk is low, CDC is watching the situation carefully and working with states to monitor people with animal exposures.\n", "\n", "CDC is using its flu surveillance systems to monitor for H5 bird flu activity in people.\n", "\n", "More information can be found here = [H5](https://www.cdc.gov/bird-flu/h5-monitoring/index.html)\n", "\n", "The below code imports from the internet the number of monthly cases of H5 in humans for each country in the world from 1997 to present.\n", "We will use this dataset to explore the number of cases in the United States. " ] }, { "cell_type": "code", "execution_count": 37, "id": "354928dd-e517-49a4-8ca2-42e67d32be93", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Entity Code Day avian_cases_month\n", "0 Africa NaN 1997-01-01 0\n", "1 Africa NaN 1997-02-01 0\n", "2 Africa NaN 1997-03-01 0\n", "3 Africa NaN 1997-04-01 0\n", "4 Africa NaN 1997-05-01 0\n", "... ... ... ... ...\n", "10349 World OWID_WRL 2024-06-01 0\n", "10350 World OWID_WRL 2024-07-01 13\n", "10351 World OWID_WRL 2024-08-01 2\n", "10352 World OWID_WRL 2024-09-01 1\n", "10353 World OWID_WRL 2024-10-01 13\n", "\n", "[10354 rows x 4 columns]\n" ] } ], "source": [ "import pandas as pd\n", "import requests\n", "\n", "# Fetch the data.\n", "df = pd.read_csv(\"https://ourworldindata.org/grapher/h5n1-flu-reported-cases.csv?v=1&csvType=full&useColumnShortNames=true\", storage_options = {'User-Agent': 'Our World In Data data fetch/1.0'})\n", "\n", "# Fetch the metadata\n", "metadata = requests.get(\"https://ourworldindata.org/grapher/h5n1-flu-reported-cases.metadata.json?v=1&csvType=full&useColumnShortNames=true\").json()\n", "\n", "print(df)" ] }, { "cell_type": "markdown", "id": "1da062a1-fa8b-412e-86d5-ff5bdf661e62", "metadata": {}, "source": [ "### Problem two Task 1\n", "\n", "Subset the dataframe ```df``` to H5 cases in the United states on or after \"2022-01-01\".\n", "You'll need to use ```loc```. \n", "Call this subsetted dataframe ```us```." ] }, { "cell_type": "markdown", "id": "753abbda-b025-4d72-8848-bc1b0770b44c", "metadata": {}, "source": [ "### Problem two Task 2\n", "\n", "Draw a barplot using seaborn such that the horizontal axis describes months and the vertical axis the number of H5 cases. \n", "Make sure to label the horizontal and vertical axes. " ] }, { "cell_type": "markdown", "id": "a16275e3-d7fa-4154-ab12-60e03ba75537", "metadata": {}, "source": [ "### Problem two Task 3\n", "\n", "The xtick labels are cluttered, making it hard to tell which months we are presenting to the reader. \n", "Seaborn doesn't make it clear, right away, how it plotted this data.\n", "Luckily, seaborn uses as its foundation matplotlib. \n", "\n", "#### Get \n", "\n", "Many of the attributes of an axis can be extarcted by appending a \"get\", an underscore, and then the attribute. \n", "For example, if we want to see how the xticks and xtick labels were created for the above plot we can write" ] }, { "cell_type": "code", "execution_count": 38, "id": "4f88f29e-b2d1-437c-b509-5ddac44d23e6", "metadata": {}, "outputs": [], "source": [ "xticks = ax.get_xticks()\n", "xticklabels = ax.get_xticklabels()" ] }, { "cell_type": "markdown", "id": "5e1223f9-c8b5-497a-8123-19464df88ce8", "metadata": {}, "source": [ "**Task** \n", "Set the xticks such that the first xtick is drawn, then the third, fifth, and so on. \n", "Use the keyword ```rotation``` to set the xticklabels such that the are rotated 45 degrees. " ] }, { "cell_type": "markdown", "id": "fd34dda0-d950-494b-9ca7-79d03fa5697f", "metadata": {}, "source": [ "## Problem Three\n", "\n", "One way to measure the association between two amount or balance variables is by using the **product-moment correlation**.\n", "The product-moment correlation is computed as \n", "\n", "\\begin{align}\n", " \\rho(X,Y) = N^{-1} \\sum_{i=1}^{N} \\frac{(x_{i} - \\bar{x})(y_{i} - \\bar{y})}{ \\text{sd}(X)\\text{sd}(Y) }\n", "\\end{align}\n", "\n", "Though the above computations looks opaque, the **product-moment correlation** (also called the correlation coefficient) has an intuitive explanation.\n", "A single data point contributes to the correlation coefficient \n", "\n", "\\begin{align}\n", " \\frac{(x_{i} - \\bar{x})(y_{i} - \\bar{y})}{ \\text{sd}(X)\\text{sd}(Y)}\n", "\\end{align}\n", "\n", "The first term $x_{i} - \\bar{x}$ measures the value of $x_{i}$ compared to its mean. \n", "If $x_{i}$ is greater than the mean than this expression is positive. \n", "If $x_{i}$ is less than the mean that this expression is negative. \n", "\n", "The product $(x_{i} - \\bar{x})(y_{i} - \\bar{y})$ will be large when both $(x_{i} - \\bar{x})$ and $(y_{i} - \\bar{y})$ are large. Otherwise the contribution of this data point---represented as the pair $(x_{i},y_{i})$---will be small. " ] }, { "cell_type": "markdown", "id": "e5ae4a98-f245-4d6c-8f87-6c58917fb7c5", "metadata": {}, "source": [ "### Problem Three Task 1\n", "\n", "Intuitively, the \"most correlated\" two variables $X$ and $Y$ could be is if their values are equal.\n", "That is, if $\\mathcal{D} = ( (x_{1},x_{1}),(x_{2},x_{2}), \\cdots, (x_{n},x_{n}) )$. \n", "\n", "Please show the value of the correlation coefficient equals under this condition. \n", "\n", "### Problem Three Task 2\n", "\n", "It is also possible that the two variables $X$ $Y$ can be perfectly correlated but opposite of one another. \n", "In otherwords, whenever we observe the value $x$ we observe the value $-x$.\n", "Show what the correlation coefficient would equal under this condition. \n", "\n", "### Problem Three Task 3\n", "\n", "Another way to understand the correlation coefficient is to first form the variables \n", "\n", "\\begin{align}\n", " z_{x_{i}} &= \\frac{x_{i} - \\bar{X} }{\\text{sd}(X)}\\\\\n", " z_{y_{i}} &= \\frac{y_{i} - \\bar{y} }{\\text{sd}(Y)}\n", "\\end{align}\n", "\n", "Substitute $X$ for $Z_{X}$ and $Y$ for $Z_{Y}$ in the expression $\\rho(X,Y)$.\n", "\n", "### Problem Three Task 4\n", "\n", "Lets look again at the association between age and cholesterol levels for the heart disease data. " ] }, { "cell_type": "code", "execution_count": null, "id": "37970238-caf3-4406-831d-f9d815f6cf64", "metadata": {}, "outputs": [], "source": [ "fig,ax = plt.subplots()\n", "ax.scatter(heart_disease[\"age\"], heart_disease[\"chol\"])\n", "\n", "ax.set_xlabel(\"Age\")\n", "ax.set_ylabel(\"Cholesterol\")\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "4458c280-c0cc-4117-9715-5686c8e494bd", "metadata": {}, "source": [ "**Task**: Use the ```assign``` function in pandas to create two new columns:\n", "1. z_age which will be each patient age minus the mean age and divided by the standard deviation.\n", "2. z_chol which will be each patient cholesterol level minus the mean cholesterol level and divided by the standard deviation." ] }, { "cell_type": "markdown", "id": "a0c1eb70-18d6-454d-8956-b1c1e098f8a3", "metadata": {}, "source": [ "### Problem Three Task 5\n", "Use a scatter plot to visualize z_age and z_chol. " ] }, { "cell_type": "markdown", "id": "bff278b4-7010-4cfb-b80d-a1c41d130baf", "metadata": {}, "source": [ "### Problem Three Task 6\n", "Compute $\\rho$ for age and cholesterol. " ] }, { "cell_type": "code", "execution_count": null, "id": "4188dd52-f090-4102-b050-3928153997df", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.16" } }, "nbformat": 4, "nbformat_minor": 5 }