{ "cells": [ { "cell_type": "markdown", "id": "edcd85ee-f98b-498d-b420-5c8a9b203612", "metadata": {}, "source": [ "# Chapter 4: Split-Apply-Combine and Merging \n", "\n", "```{contents} Table of Contents\n", ":depth: 3\n", "```" ] }, { "cell_type": "markdown", "id": "4631a0fc-5f24-4a12-92a9-5387b6a77b44", "metadata": {}, "source": [ "## Split - Apply - Combine\n", "\n", "A common task in statistical programming is to apply the same function to many groups of observations and return the results. \n", "Because individual outcomes/observations correspond to rows in a dataframe, we often need an additional column (or variable) that identifies which row belongs to which group. \n", "This additional column is almost always some set of fixed possibilities. \n", "\n", "--- \n", "\n", "*Example Australian Doctor visits :* \n", "The [dataset](https://vincentarelbundock.github.io/Rdatasets/doc/AER/DoctorVisits.html) that we will explore is a cross section of information about Australian doctor visits between the years 1977 and 1978. The reported 5,190 observations correspond to patients treated within this decade and the columns refer to information like the number of visits to a physician in the past two weeks (visits), age in years divided by 100 (age), the annual income of the patient in tens of thousands, etc. \n", "\n", "---" ] }, { "cell_type": "code", "execution_count": 2, "id": "a4b1a13e-1fd1-40b9-a5ce-83810478c201", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rownamesvisitsgenderageincomeillnessreducedhealthprivatefreepoorfreerepatnchroniclchronic
011female0.190.55141yesnononono
121female0.190.45121yesnononono
231male0.190.90300nonononono
341male0.190.15100nonononono
451male0.190.45251nononoyesno
..........................................
518551860female0.220.55000nonononono
518651870male0.271.30001nonononono
518751880female0.370.25101nonoyesnono
518851890female0.520.65000nonononono
518951900male0.720.25000nonoyesnono
\n", "

5190 rows × 13 columns

\n", "
" ], "text/plain": [ " rownames visits gender age income illness reduced health \\\n", "0 1 1 female 0.19 0.55 1 4 1 \n", "1 2 1 female 0.19 0.45 1 2 1 \n", "2 3 1 male 0.19 0.90 3 0 0 \n", "3 4 1 male 0.19 0.15 1 0 0 \n", "4 5 1 male 0.19 0.45 2 5 1 \n", "... ... ... ... ... ... ... ... ... \n", "5185 5186 0 female 0.22 0.55 0 0 0 \n", "5186 5187 0 male 0.27 1.30 0 0 1 \n", "5187 5188 0 female 0.37 0.25 1 0 1 \n", "5188 5189 0 female 0.52 0.65 0 0 0 \n", "5189 5190 0 male 0.72 0.25 0 0 0 \n", "\n", " private freepoor freerepat nchronic lchronic \n", "0 yes no no no no \n", "1 yes no no no no \n", "2 no no no no no \n", "3 no no no no no \n", "4 no no no yes no \n", "... ... ... ... ... ... \n", "5185 no no no no no \n", "5186 no no no no no \n", "5187 no no yes no no \n", "5188 no no no no no \n", "5189 no no yes no no \n", "\n", "[5190 rows x 13 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "import numpy as np \n", "\n", "doctorVisits = pd.read_csv(\"https://vincentarelbundock.github.io/Rdatasets/csv/AER/DoctorVisits.csv\")\n", "doctorVisits" ] }, { "cell_type": "markdown", "id": "e6ea89de-c4db-479a-8f13-53ec32478c6b", "metadata": {}, "source": [ "We can see that there exist specific column that can group together rows. \n", "For example, the variable ```private``` has values \"yes\" and \"no\" that correspond to whether the patient does or does not have private health insurance. \n", "This means that we can split pour dataframe into those who have private insurance and those who do not. " ] }, { "cell_type": "code", "execution_count": 3, "id": "4b4f53ba-741d-49c7-a5bd-8ecda1123f18", "metadata": {}, "outputs": [], "source": [ "private_insurance = doctorVisits.loc[doctorVisits.private ==\"yes\"]\n", "no_private_insurance = doctorVisits.loc[doctorVisits[\"private\"]==\"no\"]" ] }, { "cell_type": "markdown", "id": "9285d47c-7c1c-4993-8557-a02732a2fdb3", "metadata": {}, "source": [ "**Note:** the two different ways to refer to a variable above. \n", "We have discussed before that you can refer to a variable using square brackets and the name of the variable enclosed in quotes. \n", "However, when a dataframe is created every column name is added as an attribute. \n", "That means we can refer to the column ```private``` using either ```doctorVisits[\"private\"]``` or ```doctorVisits.private```. " ] }, { "cell_type": "markdown", "id": "fba22f53-e3a6-49ca-8488-35dc05e3fa87", "metadata": {}, "source": [ "Now that our observations have been split into groups, we may want to apply a function that computes the mean and standard deviation of the number of visits. \n", "We will create a function called ```summarize_visits``` that inputs a dataframe and outputs a dictionary with those two summary metrics. " ] }, { "cell_type": "code", "execution_count": 4, "id": "f726f47e-6789-46f0-a563-9486d9fad3de", "metadata": {}, "outputs": [], "source": [ "def summarize_visits(d,label):\n", " import numpy as np \n", " mean_visits = np.mean(d.visits)\n", " sd_visits = np.std(d.visits)\n", " return pd.Series({\"group\": label, \"mean_visits\": mean_visits, \"sd_visits\":sd_visits})\n", "\n", "#--lets apply the above function to both data frames above\n", "summary__private = summarize_visits(private_insurance ,\"yes\")\n", "summary__no_private = summarize_visits(no_private_insurance,\"no\")" ] }, { "cell_type": "markdown", "id": "e9f66c54-6120-4cee-87f1-37168c00944a", "metadata": {}, "source": [ "Lets look at one of these Series to get a feel for what they look like. " ] }, { "cell_type": "code", "execution_count": 5, "id": "af2296e8-b3da-4539-834a-603facecc35d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "group yes\n", "mean_visits 0.294604\n", "sd_visits 0.771995\n", "dtype: object" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "summary__private" ] }, { "cell_type": "markdown", "id": "956ffe0b-80b7-4cfe-b556-38e54a483fbe", "metadata": {}, "source": [ "The final step would be to combine our results from the different groups. " ] }, { "cell_type": "code", "execution_count": 6, "id": "35a69598-7ef4-443f-bbfa-3b4151fd90c0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
groupmean_visitssd_visits
0yes0.2946040.771995
1no0.3074000.818130
\n", "
" ], "text/plain": [ " group mean_visits sd_visits\n", "0 yes 0.294604 0.771995\n", "1 no 0.307400 0.818130" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "combine = pd.DataFrame([summary__private, summary__no_private])\n", "combine" ] }, { "cell_type": "markdown", "id": "9344fb21-8b54-4734-a086-0271a8002de4", "metadata": {}, "source": [ "### Groupby as a (much easier) implementation of the Split-Apply-Combine paradigm\n", "\n", "The above procedure has been used so many times that computer scientists decided to optimize this process.\n", "The code to implement the procedure is easier and will compute faster than if we implemented the above three steps. \n", "For Python, we use **groupby**. \n", "\n", "Given a dataframe, $\\mathcal{F}$, **Groupby** takes two inputs: (1) a list of column names for how to define a group. \n", "For a list of column names $[c_{1},c_{2},\\cdots,c_{n}]$, define the set of unique values in column $c_{1}$ as $C_{1}$, the unique values in $c_{2}$ as $C_{2}$, etc. \n", "Then a group is an element of $C_{1} \\times C_{2} \\times \\cdots \\times C_{n}$.\n", "Where $A \\times B = \\{ (x,y): x \\in A \\text{ and } y \\in B \\}$ is called the Cartesian product of $A$ and $B$. \n", "\n", "Input (2) is a function to apply to each group.\n", "This function should input a dataframe and return any Python object.\n", "\n", "Lets try to use thie groupby technique to return the mean and standard deviation of the number of visist, stratified by private insurance. " ] }, { "cell_type": "code", "execution_count": 7, "id": "7012f9f1-9788-4b52-b63a-8d6cb637a23f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mean_visitssd_visits
private
no0.3074000.818130
yes0.2946040.771995
\n", "
" ], "text/plain": [ " mean_visits sd_visits\n", "private \n", "no 0.307400 0.818130\n", "yes 0.294604 0.771995" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def summarize_visits(d):\n", " import numpy as np \n", " mean_visits = np.mean(d.visits)\n", " sd_visits = np.std(d.visits)\n", " return pd.Series({\"mean_visits\": mean_visits, \"sd_visits\":sd_visits})\n", "\n", "doctorVisits.groupby([\"private\"]).apply( summarize_visits,include_groups=False )" ] }, { "cell_type": "markdown", "id": "770d5ba6-5d69-46be-81cb-d740704e44f7", "metadata": {}, "source": [ "## Merging\n", "\n", "WHen you want to combine two or more datasets this is called merging. \n", "There are several types of merges, but they all rely on the same process. \n", "Given a dataset $D_{1}$ and a second dataset $D_{2}$, we want to combine them into a dataset $D$ such that \n", "1. The same observation in $D_{1}$ is matched with $D_{2}$\n", "2. The columns in $D_{1}$ are retained and the columns in $D_{2}$ are retained. \n", "\n", "The two most common types of merges are 1-1 merging and 1-manymerging. \n", "\n", "## 1-1 merging\n", "\n", "A 1-1 merge assumes that there is the same, unique observation present in both $D_{1}$ and $D_{2}$.\n", "For our example we'll use a, non-health, but instructive, database called [MovieLens](https://grouplens.org/datasets/movielens/).\n", "\n", "The MovieLens database contains four datasets: movies, links, ratings, and tags.\n", "Lets look at the movies and links datasets. " ] }, { "cell_type": "code", "execution_count": 12, "id": "3b3589d9-171e-4aa0-814b-3d6e69e781e7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
movieIdtitlegenres
01Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy
12Jumanji (1995)Adventure|Children|Fantasy
23Grumpier Old Men (1995)Comedy|Romance
34Waiting to Exhale (1995)Comedy|Drama|Romance
45Father of the Bride Part II (1995)Comedy
............
87580292731The Monroy Affaire (2022)Drama
87581292737Shelter in Solitude (2023)Comedy|Drama
87582292753Orca (2023)Drama
87583292755The Angry Breed (1968)Drama
87584292757Race to the Summit (2023)Action|Adventure|Documentary
\n", "

87585 rows × 3 columns

\n", "
" ], "text/plain": [ " movieId title \\\n", "0 1 Toy Story (1995) \n", "1 2 Jumanji (1995) \n", "2 3 Grumpier Old Men (1995) \n", "3 4 Waiting to Exhale (1995) \n", "4 5 Father of the Bride Part II (1995) \n", "... ... ... \n", "87580 292731 The Monroy Affaire (2022) \n", "87581 292737 Shelter in Solitude (2023) \n", "87582 292753 Orca (2023) \n", "87583 292755 The Angry Breed (1968) \n", "87584 292757 Race to the Summit (2023) \n", "\n", " genres \n", "0 Adventure|Animation|Children|Comedy|Fantasy \n", "1 Adventure|Children|Fantasy \n", "2 Comedy|Romance \n", "3 Comedy|Drama|Romance \n", "4 Comedy \n", "... ... \n", "87580 Drama \n", "87581 Comedy|Drama \n", "87582 Drama \n", "87583 Drama \n", "87584 Action|Adventure|Documentary \n", "\n", "[87585 rows x 3 columns]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "movies = pd.read_csv(\"./movies.csv\")\n", "movies" ] }, { "cell_type": "code", "execution_count": 13, "id": "7edb9099-c132-4b97-bf9c-c73dfe74dde1", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
movieIdimdbIdtmdbId
01114709862.0
121134978844.0
2311322815602.0
3411488531357.0
4511304111862.0
............
87580292731268125101032473.0
8758129273714907358986674.0
8758229275312388280948139.0
8758329275564027182776.0
87584292757289955661174725.0
\n", "

87585 rows × 3 columns

\n", "
" ], "text/plain": [ " movieId imdbId tmdbId\n", "0 1 114709 862.0\n", "1 2 113497 8844.0\n", "2 3 113228 15602.0\n", "3 4 114885 31357.0\n", "4 5 113041 11862.0\n", "... ... ... ...\n", "87580 292731 26812510 1032473.0\n", "87581 292737 14907358 986674.0\n", "87582 292753 12388280 948139.0\n", "87583 292755 64027 182776.0\n", "87584 292757 28995566 1174725.0\n", "\n", "[87585 rows x 3 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "links = pd.read_csv(\"./links.csv\")\n", "links" ] }, { "cell_type": "markdown", "id": "ae4e6551-eaaf-4c05-bfca-ddaa0a552206", "metadata": {}, "source": [ "Suppose that we want to \"add in \" the columns from the links dataset with the movies dataset. \n", "That is, the experiment here is the production of a move and a single \"observation\" in this experiment is one movie. \n", "\n", "Pandas has a command called `merge` that can combine these datasets.\n", "To merge, we will need to provide dataset one, dataset two, and the columns that are used in both datasets to identify on, unique observation. " ] }, { "cell_type": "code", "execution_count": 14, "id": "e16a1c26-79ae-48b3-9e87-38159d69bdec", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
movieIdtitlegenresimdbIdtmdbId
01Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy114709862.0
12Jumanji (1995)Adventure|Children|Fantasy1134978844.0
23Grumpier Old Men (1995)Comedy|Romance11322815602.0
34Waiting to Exhale (1995)Comedy|Drama|Romance11488531357.0
45Father of the Bride Part II (1995)Comedy11304111862.0
..................
87580292731The Monroy Affaire (2022)Drama268125101032473.0
87581292737Shelter in Solitude (2023)Comedy|Drama14907358986674.0
87582292753Orca (2023)Drama12388280948139.0
87583292755The Angry Breed (1968)Drama64027182776.0
87584292757Race to the Summit (2023)Action|Adventure|Documentary289955661174725.0
\n", "

87585 rows × 5 columns

\n", "
" ], "text/plain": [ " movieId title \\\n", "0 1 Toy Story (1995) \n", "1 2 Jumanji (1995) \n", "2 3 Grumpier Old Men (1995) \n", "3 4 Waiting to Exhale (1995) \n", "4 5 Father of the Bride Part II (1995) \n", "... ... ... \n", "87580 292731 The Monroy Affaire (2022) \n", "87581 292737 Shelter in Solitude (2023) \n", "87582 292753 Orca (2023) \n", "87583 292755 The Angry Breed (1968) \n", "87584 292757 Race to the Summit (2023) \n", "\n", " genres imdbId tmdbId \n", "0 Adventure|Animation|Children|Comedy|Fantasy 114709 862.0 \n", "1 Adventure|Children|Fantasy 113497 8844.0 \n", "2 Comedy|Romance 113228 15602.0 \n", "3 Comedy|Drama|Romance 114885 31357.0 \n", "4 Comedy 113041 11862.0 \n", "... ... ... ... \n", "87580 Drama 26812510 1032473.0 \n", "87581 Comedy|Drama 14907358 986674.0 \n", "87582 Drama 12388280 948139.0 \n", "87583 Drama 64027 182776.0 \n", "87584 Action|Adventure|Documentary 28995566 1174725.0 \n", "\n", "[87585 rows x 5 columns]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#d1----------d2-----how to define a unique obs\n", "movies.merge(links , on = [\"movieId\"] )" ] }, { "cell_type": "markdown", "id": "aef1224c-67a2-4d25-ac80-9026ac21ae98", "metadata": {}, "source": [ "We can see that our merge produced a new, combined, dataset: one with all the columns in `movies` plus all the columns in `links`. **Important:** it is crucial to check that the number of rows in the original and final datasets match what you would expect. Here we expected that there would exist a match in each dataset. What happens if that isnt the case? \n", "\n", "## Matching (inner, left, right)\n", "Lets remove all but 600 movies from the links dataset and see what happens when we merge this time. " ] }, { "cell_type": "code", "execution_count": 15, "id": "8c4abc03-1e92-496b-86c6-9c5a0c1e30f0", "metadata": {}, "outputs": [], "source": [ "import numpy as np \n", "pick_600_rows = np.random.choice( len(links), 600 ) #<-- seleect 600 rows at random\n", "_600_links = links.iloc[ pick_600_rows ] #<--use iloc to pick 600" ] }, { "cell_type": "code", "execution_count": 16, "id": "00e16c2c-5544-4e54-91f0-c16e874c1065", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
movieIdtitlegenresimdbIdtmdbId
095Broken Arrow (1996)Action|Adventure|Thriller1157599208.0
1202Total Eclipse (1995)Drama|Romance11470236834.0
2220Castle Freak (1995)Horror11264318256.0
3308Three Colors: White (Trzy kolory: Bialy) (1994)Comedy|Drama111507109.0
4529Searching for Bobby Fischer (1993)Drama10806514291.0
..................
595290313Turtle Vision (1991)Drama|Horror103142254874.0
596290615The Ones You Didn’t Burn (2022)Horror14080416974264.0
597290783The Princesses and the Mysterious Guest (2023)Adventure|Animation267612661078843.0
598291142Born Again (2015)Drama4940528420085.0
599291310Moosie (1999)Adventure|Children165885304696.0
\n", "

600 rows × 5 columns

\n", "
" ], "text/plain": [ " movieId title \\\n", "0 95 Broken Arrow (1996) \n", "1 202 Total Eclipse (1995) \n", "2 220 Castle Freak (1995) \n", "3 308 Three Colors: White (Trzy kolory: Bialy) (1994) \n", "4 529 Searching for Bobby Fischer (1993) \n", ".. ... ... \n", "595 290313 Turtle Vision (1991) \n", "596 290615 The Ones You Didn’t Burn (2022) \n", "597 290783 The Princesses and the Mysterious Guest (2023) \n", "598 291142 Born Again (2015) \n", "599 291310 Moosie (1999) \n", "\n", " genres imdbId tmdbId \n", "0 Action|Adventure|Thriller 115759 9208.0 \n", "1 Drama|Romance 114702 36834.0 \n", "2 Horror 112643 18256.0 \n", "3 Comedy|Drama 111507 109.0 \n", "4 Drama 108065 14291.0 \n", ".. ... ... ... \n", "595 Drama|Horror 103142 254874.0 \n", "596 Horror 14080416 974264.0 \n", "597 Adventure|Animation 26761266 1078843.0 \n", "598 Drama 4940528 420085.0 \n", "599 Adventure|Children 165885 304696.0 \n", "\n", "[600 rows x 5 columns]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "movies.merge(_600_links , on = [\"movieId\"] )" ] }, { "cell_type": "markdown", "id": "3df3dc78-65ac-4903-beb2-27c8653bf199", "metadata": {}, "source": [ "What happened? Even though the movies dataset had 87,585 movies, only 600 appear in our merge!\n", "Where are our movies? \n", "The default behavior for merge is to run what is called an \"inner merge\".\n", "\n", "An inner merge looks at the unique observations in $D_{1}$, finds those observations that match in $D_{2}$ and the nonly returns those observations that appear in **both** datasets. " ] }, { "cell_type": "markdown", "id": "dd3c84db-d496-4a3b-a986-d05df5ee36db", "metadata": {}, "source": [ "### Left Merge\n", "\n", "A \"left\" merge will keep all observations from $D_{1}$ and add-in matching observations from $D_{2}$.\n", "Columns from $D_{2}$ will be combined with columns in $D_{1}$. \n", "If there is no matching observation in $D_{2}$ then the columns that correspond to $D_{2}$ will be filled with the value ```nan```. \n", "\n", "An example is useful to illustrate the impact of a left merge.\n", "To run a left merge in Python we use the same merge command and add the keyword ```how=\"left\"```." ] }, { "cell_type": "code", "execution_count": 17, "id": "96138142-3fb9-43cd-acde-f228b49025bd", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
movieIdtitlegenresimdbIdtmdbId
315319Shallow Grave (1994)Comedy|Drama|ThrillerNaNNaN
316320Suture (1993)Film-Noir|ThrillerNaNNaN
317321Strawberry and Chocolate (Fresa y chocolate) (...DramaNaNNaN
318322Swimming with Sharks (1995)Comedy|DramaNaNNaN
319324Sum of Us, The (1994)Comedy|DramaNaNNaN
320325National Lampoon's Senior Trip (1995)ComedyNaNNaN
321326To Live (Huozhe) (1994)DramaNaNNaN
322327Tank Girl (1995)Action|Comedy|Sci-FiNaNNaN
323328Tales from the Crypt Presents: Demon Knight (1...Horror|ThrillerNaNNaN
324329Star Trek: Generations (1994)Adventure|Drama|Sci-FiNaNNaN
\n", "
" ], "text/plain": [ " movieId title \\\n", "315 319 Shallow Grave (1994) \n", "316 320 Suture (1993) \n", "317 321 Strawberry and Chocolate (Fresa y chocolate) (... \n", "318 322 Swimming with Sharks (1995) \n", "319 324 Sum of Us, The (1994) \n", "320 325 National Lampoon's Senior Trip (1995) \n", "321 326 To Live (Huozhe) (1994) \n", "322 327 Tank Girl (1995) \n", "323 328 Tales from the Crypt Presents: Demon Knight (1... \n", "324 329 Star Trek: Generations (1994) \n", "\n", " genres imdbId tmdbId \n", "315 Comedy|Drama|Thriller NaN NaN \n", "316 Film-Noir|Thriller NaN NaN \n", "317 Drama NaN NaN \n", "318 Comedy|Drama NaN NaN \n", "319 Comedy|Drama NaN NaN \n", "320 Comedy NaN NaN \n", "321 Drama NaN NaN \n", "322 Action|Comedy|Sci-Fi NaN NaN \n", "323 Horror|Thriller NaN NaN \n", "324 Adventure|Drama|Sci-Fi NaN NaN " ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "left_merged_movies = movies.merge(_600_links , on = [\"movieId\"], how = \"left\" )\n", "left_merged_movies.iloc[315:325]" ] }, { "cell_type": "markdown", "id": "14fb5dec-f6bd-420f-ac59-7da63fef9105", "metadata": {}, "source": [ "In the above merge, $D_{1}$ contains all of the observations with movie ids 319,320,...329. \n", "However, the dataset $D_{2}$ only contains the observation corresponding to movie id 320. \n", "Since there is a match for movie id 320 the columns in $D_{2}$ (imdbId\ttmdbId) are combined with $D_{1}$.\n", "Since, for example, there is no movie id 319 in $D_{2}$ the columns imdbId\ttmdbId are filled with ```nan```.\n", "\n", "### Right merge\n", "A right merge behaves exactly the same as a left merge with on exception.\n", "The second dataset in the merge ($D_{2}$) is the dataset that contains all observations and any observatins in $D_{1}$ that match $D_{2}$ will be added to $D_{2}$." ] }, { "cell_type": "code", "execution_count": 18, "id": "1820c72c-71d1-446a-baa7-da9152de4339", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
movieIdtitlegenresimdbIdtmdbId
31548231Taxidermia (2006)Comedy|Drama|Horror41073015742.0
316225924The Call (2020)(no genres listed)12971924746817.0
31795567People Like Us (2012)Drama171677798548.0
318157085O Coronel e o Lobisomem (2005)Comedy45611729009.0
31989924I Want Candy (2007)Comedy79130913687.0
320187179Kyrsyä: Tuftland (2018)Horror|Mystery5104330484237.0
321279192Simchas and Sorrows (2022)Comedy|Drama14871354872201.0
322286839Personality Crisis: One Night Only (2022)Documentary126678381013574.0
3233499Misery (1990)Drama|Horror|Thriller1001571700.0
324220370The Dalai Lama: ScientistDocumentary7749892689200.0
\n", "
" ], "text/plain": [ " movieId title \\\n", "315 48231 Taxidermia (2006) \n", "316 225924 The Call (2020) \n", "317 95567 People Like Us (2012) \n", "318 157085 O Coronel e o Lobisomem (2005) \n", "319 89924 I Want Candy (2007) \n", "320 187179 Kyrsyä: Tuftland (2018) \n", "321 279192 Simchas and Sorrows (2022) \n", "322 286839 Personality Crisis: One Night Only (2022) \n", "323 3499 Misery (1990) \n", "324 220370 The Dalai Lama: Scientist \n", "\n", " genres imdbId tmdbId \n", "315 Comedy|Drama|Horror 410730 15742.0 \n", "316 (no genres listed) 12971924 746817.0 \n", "317 Drama 1716777 98548.0 \n", "318 Comedy 456117 29009.0 \n", "319 Comedy 791309 13687.0 \n", "320 Horror|Mystery 5104330 484237.0 \n", "321 Comedy|Drama 14871354 872201.0 \n", "322 Documentary 12667838 1013574.0 \n", "323 Drama|Horror|Thriller 100157 1700.0 \n", "324 Documentary 7749892 689200.0 " ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "right_merged_movies = movies.merge(_600_links , on = [\"movieId\"], how = \"right\" )\n", "right_merged_movies.iloc[315:325]" ] }, { "cell_type": "markdown", "id": "b0a19fa9-4c9a-46b9-a489-540caf58b52f", "metadata": {}, "source": [ "## In class excercise and exploration" ] }, { "cell_type": "code", "execution_count": 19, "id": "b8092fa6-4997-4c81-ba3b-49adc53c27af", "metadata": {}, "outputs": [], "source": [ "covid= pd.read_csv(\"https://data.cdc.gov/resource/pwn4-m3yp.csv?$limit=500000\")\n", "\n", "#--NOT NEEDED FOR LESSON, this code matches weeks between covid and flu---------------------------------\n", "def from_day_to_closest_saturday(x):\n", " from datetime import datetime, timedelta\n", " day = datetime.strptime(x.split(\"T\")[0], \"%Y-%m-%d\")\n", " while day.weekday()!=5:\n", " day = day + timedelta(days=1)\n", " return day.strftime(\"%Y-%m-%d\")\n", "end_of_weeks = [from_day_to_closest_saturday(x) for x in covid.end_date]\n", "\n", "covid[\"end_of_weeks\"] = end_of_weeks\n", "#------------------------------------------------------------\n", "\n", "flu = pd.read_csv(\"https://raw.githubusercontent.com/cdcepi/FluSight-forecast-hub/refs/heads/main/target-data/target-hospital-admissions.csv\")" ] }, { "cell_type": "markdown", "id": "c5ebe1fe-bd58-4b85-b9d6-8afd6422de92", "metadata": {}, "source": [ "## Exercises\n", "1. Subset the COVID dataset to one state (your choice)\n", "2. Subset the Flu dataset to one state (same state as the COVID dataset)\n", "3. Plot the number of new: cases and deaths due to COVID in a single plot with two \"axes\"\n", "4. Plot the number of incident hospitalizations (value column) per week due to Flu\n", "5. Merge together the COVID and Flu data based on week. This will be the ```end_of_weeks``` variable in the COVID dataset and ```date``` variable in Flu\n", "6. Plot the number of new cases due to COVID and the number of new cases due to Flu on a single plot.\n", "7. Realize that this isnt very helpful\n", "8. Read this [documentation](https://matplotlib.org/stable/gallery/subplots_axes_and_figures/two_scales.html#sphx-glr-gallery-subplots-axes-and-figures-two-scales-py) and twin your axes\n", "9. Plot COVID on one axis and Flu on the twin" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.8" } }, "nbformat": 4, "nbformat_minor": 5 }