{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "edcd85ee-f98b-498d-b420-5c8a9b203612",
   "metadata": {},
   "source": [
    "# Chapter 4: Split-Apply-Combine and Merging \n",
    "\n",
    "```{contents} Table of Contents\n",
    ":depth: 3\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4631a0fc-5f24-4a12-92a9-5387b6a77b44",
   "metadata": {},
   "source": [
    "## Split - Apply - Combine\n",
    "\n",
    "A common task in statistical programming is to apply the same function to many groups of observations and return the results. \n",
    "Because individual outcomes/observations correspond to rows in a dataframe, we often need an additional column (or variable) that identifies which row belongs to which group. \n",
    "This additional column is almost always some set of fixed possibilities. \n",
    "\n",
    "--- \n",
    "\n",
    "*Example Australian Doctor visits :* \n",
    "The [dataset](https://vincentarelbundock.github.io/Rdatasets/doc/AER/DoctorVisits.html) that we will explore is a cross section of information about Australian doctor visits between the years 1977 and 1978. The reported 5,190 observations correspond to patients treated within this decade and the columns refer to information like the number of visits to a physician in the past two weeks (visits), age in years divided by 100 (age), the annual income of the patient in tens of thousands, etc. \n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "a4b1a13e-1fd1-40b9-a5ce-83810478c201",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>rownames</th>\n",
       "      <th>visits</th>\n",
       "      <th>gender</th>\n",
       "      <th>age</th>\n",
       "      <th>income</th>\n",
       "      <th>illness</th>\n",
       "      <th>reduced</th>\n",
       "      <th>health</th>\n",
       "      <th>private</th>\n",
       "      <th>freepoor</th>\n",
       "      <th>freerepat</th>\n",
       "      <th>nchronic</th>\n",
       "      <th>lchronic</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>female</td>\n",
       "      <td>0.19</td>\n",
       "      <td>0.55</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>yes</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>female</td>\n",
       "      <td>0.19</td>\n",
       "      <td>0.45</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>yes</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>male</td>\n",
       "      <td>0.19</td>\n",
       "      <td>0.90</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>male</td>\n",
       "      <td>0.19</td>\n",
       "      <td>0.15</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "      <td>male</td>\n",
       "      <td>0.19</td>\n",
       "      <td>0.45</td>\n",
       "      <td>2</td>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>yes</td>\n",
       "      <td>no</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5185</th>\n",
       "      <td>5186</td>\n",
       "      <td>0</td>\n",
       "      <td>female</td>\n",
       "      <td>0.22</td>\n",
       "      <td>0.55</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5186</th>\n",
       "      <td>5187</td>\n",
       "      <td>0</td>\n",
       "      <td>male</td>\n",
       "      <td>0.27</td>\n",
       "      <td>1.30</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5187</th>\n",
       "      <td>5188</td>\n",
       "      <td>0</td>\n",
       "      <td>female</td>\n",
       "      <td>0.37</td>\n",
       "      <td>0.25</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>yes</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5188</th>\n",
       "      <td>5189</td>\n",
       "      <td>0</td>\n",
       "      <td>female</td>\n",
       "      <td>0.52</td>\n",
       "      <td>0.65</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5189</th>\n",
       "      <td>5190</td>\n",
       "      <td>0</td>\n",
       "      <td>male</td>\n",
       "      <td>0.72</td>\n",
       "      <td>0.25</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "      <td>yes</td>\n",
       "      <td>no</td>\n",
       "      <td>no</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5190 rows × 13 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      rownames  visits  gender   age  income  illness  reduced  health  \\\n",
       "0            1       1  female  0.19    0.55        1        4       1   \n",
       "1            2       1  female  0.19    0.45        1        2       1   \n",
       "2            3       1    male  0.19    0.90        3        0       0   \n",
       "3            4       1    male  0.19    0.15        1        0       0   \n",
       "4            5       1    male  0.19    0.45        2        5       1   \n",
       "...        ...     ...     ...   ...     ...      ...      ...     ...   \n",
       "5185      5186       0  female  0.22    0.55        0        0       0   \n",
       "5186      5187       0    male  0.27    1.30        0        0       1   \n",
       "5187      5188       0  female  0.37    0.25        1        0       1   \n",
       "5188      5189       0  female  0.52    0.65        0        0       0   \n",
       "5189      5190       0    male  0.72    0.25        0        0       0   \n",
       "\n",
       "     private freepoor freerepat nchronic lchronic  \n",
       "0        yes       no        no       no       no  \n",
       "1        yes       no        no       no       no  \n",
       "2         no       no        no       no       no  \n",
       "3         no       no        no       no       no  \n",
       "4         no       no        no      yes       no  \n",
       "...      ...      ...       ...      ...      ...  \n",
       "5185      no       no        no       no       no  \n",
       "5186      no       no        no       no       no  \n",
       "5187      no       no       yes       no       no  \n",
       "5188      no       no        no       no       no  \n",
       "5189      no       no       yes       no       no  \n",
       "\n",
       "[5190 rows x 13 columns]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import numpy as np \n",
    "\n",
    "doctorVisits = pd.read_csv(\"https://vincentarelbundock.github.io/Rdatasets/csv/AER/DoctorVisits.csv\")\n",
    "doctorVisits"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e6ea89de-c4db-479a-8f13-53ec32478c6b",
   "metadata": {},
   "source": [
    "We can see that there exist specific column that can group together rows. \n",
    "For example, the variable ```private``` has values \"yes\" and \"no\" that correspond to whether the patient does or does not have private health insurance. \n",
    "This means that we can split pour dataframe into those who have private insurance and those who do not. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "4b4f53ba-741d-49c7-a5bd-8ecda1123f18",
   "metadata": {},
   "outputs": [],
   "source": [
    "private_insurance    = doctorVisits.loc[doctorVisits.private   ==\"yes\"]\n",
    "no_private_insurance = doctorVisits.loc[doctorVisits[\"private\"]==\"no\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9285d47c-7c1c-4993-8557-a02732a2fdb3",
   "metadata": {},
   "source": [
    "**Note:** the two different ways to refer to a variable above. \n",
    "We have discussed before that you can refer to a variable using square brackets and the name of the variable enclosed in quotes. \n",
    "However, when a dataframe is created every column name is added as an attribute. \n",
    "That means we can refer to the column ```private``` using either ```doctorVisits[\"private\"]``` or ```doctorVisits.private```. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fba22f53-e3a6-49ca-8488-35dc05e3fa87",
   "metadata": {},
   "source": [
    "Now that our observations have been split into groups, we may want to apply a function that computes the mean and standard deviation of the number of visits. \n",
    "We will create a function called ```summarize_visits``` that inputs a dataframe and outputs a dictionary with those two summary metrics. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "f726f47e-6789-46f0-a563-9486d9fad3de",
   "metadata": {},
   "outputs": [],
   "source": [
    "def summarize_visits(d,label):\n",
    "    import numpy as np \n",
    "    mean_visits = np.mean(d.visits)\n",
    "    sd_visits   = np.std(d.visits)\n",
    "    return pd.Series({\"group\": label,  \"mean_visits\": mean_visits, \"sd_visits\":sd_visits})\n",
    "\n",
    "#--lets apply the above function to both data frames above\n",
    "summary__private    = summarize_visits(private_insurance   ,\"yes\")\n",
    "summary__no_private = summarize_visits(no_private_insurance,\"no\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e9f66c54-6120-4cee-87f1-37168c00944a",
   "metadata": {},
   "source": [
    "Lets look at one of these Series to get a feel for what they look like. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "af2296e8-b3da-4539-834a-603facecc35d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "group               yes\n",
       "mean_visits    0.294604\n",
       "sd_visits      0.771995\n",
       "dtype: object"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "summary__private"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "956ffe0b-80b7-4cfe-b556-38e54a483fbe",
   "metadata": {},
   "source": [
    "The final step would be to combine our results from the different groups. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "35a69598-7ef4-443f-bbfa-3b4151fd90c0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>group</th>\n",
       "      <th>mean_visits</th>\n",
       "      <th>sd_visits</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>yes</td>\n",
       "      <td>0.294604</td>\n",
       "      <td>0.771995</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>no</td>\n",
       "      <td>0.307400</td>\n",
       "      <td>0.818130</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  group  mean_visits  sd_visits\n",
       "0   yes     0.294604   0.771995\n",
       "1    no     0.307400   0.818130"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "combine = pd.DataFrame([summary__private, summary__no_private])\n",
    "combine"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9344fb21-8b54-4734-a086-0271a8002de4",
   "metadata": {},
   "source": [
    "### Groupby as a (much easier) implementation of the Split-Apply-Combine paradigm\n",
    "\n",
    "The above procedure has been used so many times that computer scientists decided to optimize this process.\n",
    "The code to implement the procedure is easier and will compute faster than if we implemented the above three steps. \n",
    "For Python, we use **groupby**. \n",
    "\n",
    "Given a dataframe, $\\mathcal{F}$, **Groupby** takes two inputs: (1) a list of column names for how to define a group. \n",
    "For a list of column names $[c_{1},c_{2},\\cdots,c_{n}]$, define the set of unique values in column $c_{1}$ as $C_{1}$, the unique values in $c_{2}$ as $C_{2}$, etc. \n",
    "Then a group is an element of $C_{1} \\times C_{2} \\times \\cdots \\times C_{n}$.\n",
    "Where $A \\times B = \\{ (x,y): x \\in A \\text{ and } y \\in B \\}$ is called the Cartesian product of $A$ and $B$. \n",
    "\n",
    "Input (2) is a function to apply to each group.\n",
    "This function should input a dataframe and return any Python object.\n",
    "\n",
    "Lets try to use thie groupby technique to return the mean and standard deviation of the number of visist, stratified by private insurance. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "7012f9f1-9788-4b52-b63a-8d6cb637a23f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>mean_visits</th>\n",
       "      <th>sd_visits</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>private</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>no</th>\n",
       "      <td>0.307400</td>\n",
       "      <td>0.818130</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>yes</th>\n",
       "      <td>0.294604</td>\n",
       "      <td>0.771995</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         mean_visits  sd_visits\n",
       "private                        \n",
       "no          0.307400   0.818130\n",
       "yes         0.294604   0.771995"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def summarize_visits(d):\n",
    "    import numpy as np \n",
    "    mean_visits = np.mean(d.visits)\n",
    "    sd_visits   = np.std(d.visits)\n",
    "    return pd.Series({\"mean_visits\": mean_visits, \"sd_visits\":sd_visits})\n",
    "\n",
    "doctorVisits.groupby([\"private\"]).apply( summarize_visits,include_groups=False )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "770d5ba6-5d69-46be-81cb-d740704e44f7",
   "metadata": {},
   "source": [
    "## Merging\n",
    "\n",
    "WHen you want to combine two or more datasets this is called merging. \n",
    "There are several types of merges, but they all rely on the same process. \n",
    "Given a dataset $D_{1}$ and a second dataset $D_{2}$, we want to combine them into a dataset $D$ such that \n",
    "1. The same observation in $D_{1}$ is matched with $D_{2}$\n",
    "2. The columns in $D_{1}$ are retained and the columns in $D_{2}$ are retained. \n",
    "\n",
    "The two most common types of merges are 1-1 merging and 1-manymerging. \n",
    "\n",
    "## 1-1 merging\n",
    "\n",
    "A 1-1 merge assumes that there is the same, unique observation present in both $D_{1}$ and $D_{2}$.\n",
    "For our example we'll use a, non-health, but instructive, database called [MovieLens](https://grouplens.org/datasets/movielens/).\n",
    "\n",
    "The MovieLens database contains four datasets: movies, links, ratings, and tags.\n",
    "Lets look at the movies and links datasets. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "3b3589d9-171e-4aa0-814b-3d6e69e781e7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movieId</th>\n",
       "      <th>title</th>\n",
       "      <th>genres</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>Toy Story (1995)</td>\n",
       "      <td>Adventure|Animation|Children|Comedy|Fantasy</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>Jumanji (1995)</td>\n",
       "      <td>Adventure|Children|Fantasy</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>Grumpier Old Men (1995)</td>\n",
       "      <td>Comedy|Romance</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>Waiting to Exhale (1995)</td>\n",
       "      <td>Comedy|Drama|Romance</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>Father of the Bride Part II (1995)</td>\n",
       "      <td>Comedy</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>87580</th>\n",
       "      <td>292731</td>\n",
       "      <td>The Monroy Affaire (2022)</td>\n",
       "      <td>Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>87581</th>\n",
       "      <td>292737</td>\n",
       "      <td>Shelter in Solitude (2023)</td>\n",
       "      <td>Comedy|Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>87582</th>\n",
       "      <td>292753</td>\n",
       "      <td>Orca (2023)</td>\n",
       "      <td>Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>87583</th>\n",
       "      <td>292755</td>\n",
       "      <td>The Angry Breed (1968)</td>\n",
       "      <td>Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>87584</th>\n",
       "      <td>292757</td>\n",
       "      <td>Race to the Summit (2023)</td>\n",
       "      <td>Action|Adventure|Documentary</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>87585 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       movieId                               title  \\\n",
       "0            1                    Toy Story (1995)   \n",
       "1            2                      Jumanji (1995)   \n",
       "2            3             Grumpier Old Men (1995)   \n",
       "3            4            Waiting to Exhale (1995)   \n",
       "4            5  Father of the Bride Part II (1995)   \n",
       "...        ...                                 ...   \n",
       "87580   292731           The Monroy Affaire (2022)   \n",
       "87581   292737          Shelter in Solitude (2023)   \n",
       "87582   292753                         Orca (2023)   \n",
       "87583   292755              The Angry Breed (1968)   \n",
       "87584   292757           Race to the Summit (2023)   \n",
       "\n",
       "                                            genres  \n",
       "0      Adventure|Animation|Children|Comedy|Fantasy  \n",
       "1                       Adventure|Children|Fantasy  \n",
       "2                                   Comedy|Romance  \n",
       "3                             Comedy|Drama|Romance  \n",
       "4                                           Comedy  \n",
       "...                                            ...  \n",
       "87580                                        Drama  \n",
       "87581                                 Comedy|Drama  \n",
       "87582                                        Drama  \n",
       "87583                                        Drama  \n",
       "87584                 Action|Adventure|Documentary  \n",
       "\n",
       "[87585 rows x 3 columns]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "movies = pd.read_csv(\"./movies.csv\")\n",
    "movies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "7edb9099-c132-4b97-bf9c-c73dfe74dde1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movieId</th>\n",
       "      <th>imdbId</th>\n",
       "      <th>tmdbId</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>114709</td>\n",
       "      <td>862.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>113497</td>\n",
       "      <td>8844.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>113228</td>\n",
       "      <td>15602.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>114885</td>\n",
       "      <td>31357.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>113041</td>\n",
       "      <td>11862.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>87580</th>\n",
       "      <td>292731</td>\n",
       "      <td>26812510</td>\n",
       "      <td>1032473.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>87581</th>\n",
       "      <td>292737</td>\n",
       "      <td>14907358</td>\n",
       "      <td>986674.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>87582</th>\n",
       "      <td>292753</td>\n",
       "      <td>12388280</td>\n",
       "      <td>948139.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>87583</th>\n",
       "      <td>292755</td>\n",
       "      <td>64027</td>\n",
       "      <td>182776.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>87584</th>\n",
       "      <td>292757</td>\n",
       "      <td>28995566</td>\n",
       "      <td>1174725.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>87585 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       movieId    imdbId     tmdbId\n",
       "0            1    114709      862.0\n",
       "1            2    113497     8844.0\n",
       "2            3    113228    15602.0\n",
       "3            4    114885    31357.0\n",
       "4            5    113041    11862.0\n",
       "...        ...       ...        ...\n",
       "87580   292731  26812510  1032473.0\n",
       "87581   292737  14907358   986674.0\n",
       "87582   292753  12388280   948139.0\n",
       "87583   292755     64027   182776.0\n",
       "87584   292757  28995566  1174725.0\n",
       "\n",
       "[87585 rows x 3 columns]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "links  = pd.read_csv(\"./links.csv\")\n",
    "links"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ae4e6551-eaaf-4c05-bfca-ddaa0a552206",
   "metadata": {},
   "source": [
    "Suppose that we want to \"add in \" the columns from the links dataset with the movies dataset. \n",
    "That is, the experiment here is the production of a move and a single \"observation\" in this experiment is one movie. \n",
    "\n",
    "Pandas has a command called `merge` that can combine these datasets.\n",
    "To merge, we will need to provide dataset one, dataset two, and the columns that are used in both datasets to identify on, unique observation. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "e16a1c26-79ae-48b3-9e87-38159d69bdec",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movieId</th>\n",
       "      <th>title</th>\n",
       "      <th>genres</th>\n",
       "      <th>imdbId</th>\n",
       "      <th>tmdbId</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>Toy Story (1995)</td>\n",
       "      <td>Adventure|Animation|Children|Comedy|Fantasy</td>\n",
       "      <td>114709</td>\n",
       "      <td>862.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>Jumanji (1995)</td>\n",
       "      <td>Adventure|Children|Fantasy</td>\n",
       "      <td>113497</td>\n",
       "      <td>8844.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>Grumpier Old Men (1995)</td>\n",
       "      <td>Comedy|Romance</td>\n",
       "      <td>113228</td>\n",
       "      <td>15602.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>Waiting to Exhale (1995)</td>\n",
       "      <td>Comedy|Drama|Romance</td>\n",
       "      <td>114885</td>\n",
       "      <td>31357.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>Father of the Bride Part II (1995)</td>\n",
       "      <td>Comedy</td>\n",
       "      <td>113041</td>\n",
       "      <td>11862.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>87580</th>\n",
       "      <td>292731</td>\n",
       "      <td>The Monroy Affaire (2022)</td>\n",
       "      <td>Drama</td>\n",
       "      <td>26812510</td>\n",
       "      <td>1032473.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>87581</th>\n",
       "      <td>292737</td>\n",
       "      <td>Shelter in Solitude (2023)</td>\n",
       "      <td>Comedy|Drama</td>\n",
       "      <td>14907358</td>\n",
       "      <td>986674.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>87582</th>\n",
       "      <td>292753</td>\n",
       "      <td>Orca (2023)</td>\n",
       "      <td>Drama</td>\n",
       "      <td>12388280</td>\n",
       "      <td>948139.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>87583</th>\n",
       "      <td>292755</td>\n",
       "      <td>The Angry Breed (1968)</td>\n",
       "      <td>Drama</td>\n",
       "      <td>64027</td>\n",
       "      <td>182776.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>87584</th>\n",
       "      <td>292757</td>\n",
       "      <td>Race to the Summit (2023)</td>\n",
       "      <td>Action|Adventure|Documentary</td>\n",
       "      <td>28995566</td>\n",
       "      <td>1174725.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>87585 rows × 5 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       movieId                               title  \\\n",
       "0            1                    Toy Story (1995)   \n",
       "1            2                      Jumanji (1995)   \n",
       "2            3             Grumpier Old Men (1995)   \n",
       "3            4            Waiting to Exhale (1995)   \n",
       "4            5  Father of the Bride Part II (1995)   \n",
       "...        ...                                 ...   \n",
       "87580   292731           The Monroy Affaire (2022)   \n",
       "87581   292737          Shelter in Solitude (2023)   \n",
       "87582   292753                         Orca (2023)   \n",
       "87583   292755              The Angry Breed (1968)   \n",
       "87584   292757           Race to the Summit (2023)   \n",
       "\n",
       "                                            genres    imdbId     tmdbId  \n",
       "0      Adventure|Animation|Children|Comedy|Fantasy    114709      862.0  \n",
       "1                       Adventure|Children|Fantasy    113497     8844.0  \n",
       "2                                   Comedy|Romance    113228    15602.0  \n",
       "3                             Comedy|Drama|Romance    114885    31357.0  \n",
       "4                                           Comedy    113041    11862.0  \n",
       "...                                            ...       ...        ...  \n",
       "87580                                        Drama  26812510  1032473.0  \n",
       "87581                                 Comedy|Drama  14907358   986674.0  \n",
       "87582                                        Drama  12388280   948139.0  \n",
       "87583                                        Drama     64027   182776.0  \n",
       "87584                 Action|Adventure|Documentary  28995566  1174725.0  \n",
       "\n",
       "[87585 rows x 5 columns]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#d1----------d2-----how to define a unique obs\n",
    "movies.merge(links  , on = [\"movieId\"]        )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aef1224c-67a2-4d25-ac80-9026ac21ae98",
   "metadata": {},
   "source": [
    "We can see that our merge produced a new, combined, dataset: one with all the columns in `movies` plus all the columns in `links`. **Important:** it is crucial to check that the number of rows in the original and final datasets match what you would expect. Here we expected that there would exist a match in each dataset. What happens if that isnt the case? \n",
    "\n",
    "## Matching (inner, left, right)\n",
    "Lets remove all but 600 movies from the links dataset and see what happens when we merge this time. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "8c4abc03-1e92-496b-86c6-9c5a0c1e30f0",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy  as np \n",
    "pick_600_rows = np.random.choice( len(links), 600 ) #<-- seleect 600 rows at random\n",
    "_600_links = links.iloc[ pick_600_rows ]            #<--use iloc to pick 600"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "00e16c2c-5544-4e54-91f0-c16e874c1065",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movieId</th>\n",
       "      <th>title</th>\n",
       "      <th>genres</th>\n",
       "      <th>imdbId</th>\n",
       "      <th>tmdbId</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>95</td>\n",
       "      <td>Broken Arrow (1996)</td>\n",
       "      <td>Action|Adventure|Thriller</td>\n",
       "      <td>115759</td>\n",
       "      <td>9208.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>202</td>\n",
       "      <td>Total Eclipse (1995)</td>\n",
       "      <td>Drama|Romance</td>\n",
       "      <td>114702</td>\n",
       "      <td>36834.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>220</td>\n",
       "      <td>Castle Freak (1995)</td>\n",
       "      <td>Horror</td>\n",
       "      <td>112643</td>\n",
       "      <td>18256.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>308</td>\n",
       "      <td>Three Colors: White (Trzy kolory: Bialy) (1994)</td>\n",
       "      <td>Comedy|Drama</td>\n",
       "      <td>111507</td>\n",
       "      <td>109.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>529</td>\n",
       "      <td>Searching for Bobby Fischer (1993)</td>\n",
       "      <td>Drama</td>\n",
       "      <td>108065</td>\n",
       "      <td>14291.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>595</th>\n",
       "      <td>290313</td>\n",
       "      <td>Turtle Vision (1991)</td>\n",
       "      <td>Drama|Horror</td>\n",
       "      <td>103142</td>\n",
       "      <td>254874.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>596</th>\n",
       "      <td>290615</td>\n",
       "      <td>The Ones You Didn’t Burn (2022)</td>\n",
       "      <td>Horror</td>\n",
       "      <td>14080416</td>\n",
       "      <td>974264.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>597</th>\n",
       "      <td>290783</td>\n",
       "      <td>The Princesses and the Mysterious Guest (2023)</td>\n",
       "      <td>Adventure|Animation</td>\n",
       "      <td>26761266</td>\n",
       "      <td>1078843.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>598</th>\n",
       "      <td>291142</td>\n",
       "      <td>Born Again (2015)</td>\n",
       "      <td>Drama</td>\n",
       "      <td>4940528</td>\n",
       "      <td>420085.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>599</th>\n",
       "      <td>291310</td>\n",
       "      <td>Moosie (1999)</td>\n",
       "      <td>Adventure|Children</td>\n",
       "      <td>165885</td>\n",
       "      <td>304696.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>600 rows × 5 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     movieId                                            title  \\\n",
       "0         95                              Broken Arrow (1996)   \n",
       "1        202                             Total Eclipse (1995)   \n",
       "2        220                              Castle Freak (1995)   \n",
       "3        308  Three Colors: White (Trzy kolory: Bialy) (1994)   \n",
       "4        529               Searching for Bobby Fischer (1993)   \n",
       "..       ...                                              ...   \n",
       "595   290313                             Turtle Vision (1991)   \n",
       "596   290615                  The Ones You Didn’t Burn (2022)   \n",
       "597   290783   The Princesses and the Mysterious Guest (2023)   \n",
       "598   291142                                Born Again (2015)   \n",
       "599   291310                                    Moosie (1999)   \n",
       "\n",
       "                        genres    imdbId     tmdbId  \n",
       "0    Action|Adventure|Thriller    115759     9208.0  \n",
       "1                Drama|Romance    114702    36834.0  \n",
       "2                       Horror    112643    18256.0  \n",
       "3                 Comedy|Drama    111507      109.0  \n",
       "4                        Drama    108065    14291.0  \n",
       "..                         ...       ...        ...  \n",
       "595               Drama|Horror    103142   254874.0  \n",
       "596                     Horror  14080416   974264.0  \n",
       "597        Adventure|Animation  26761266  1078843.0  \n",
       "598                      Drama   4940528   420085.0  \n",
       "599         Adventure|Children    165885   304696.0  \n",
       "\n",
       "[600 rows x 5 columns]"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "movies.merge(_600_links  , on = [\"movieId\"]        )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3df3dc78-65ac-4903-beb2-27c8653bf199",
   "metadata": {},
   "source": [
    "What happened? Even though the movies dataset had 87,585 movies, only 600 appear in our merge!\n",
    "Where are our movies? \n",
    "The default behavior for merge is to run what is called an \"inner merge\".\n",
    "\n",
    "An inner merge looks at the unique observations in $D_{1}$, finds those observations that match in $D_{2}$ and the nonly returns those observations that appear in **both** datasets. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dd3c84db-d496-4a3b-a986-d05df5ee36db",
   "metadata": {},
   "source": [
    "### Left Merge\n",
    "\n",
    "A \"left\" merge will keep all observations from $D_{1}$ and add-in matching observations from $D_{2}$.\n",
    "Columns from $D_{2}$ will be combined with columns in $D_{1}$. \n",
    "If there is no matching observation in $D_{2}$ then the columns that correspond to $D_{2}$ will be filled with the value ```nan```. \n",
    "\n",
    "An example is useful to illustrate the impact of a left merge.\n",
    "To run a left merge in Python we use the same merge command and add the keyword ```how=\"left\"```."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "96138142-3fb9-43cd-acde-f228b49025bd",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movieId</th>\n",
       "      <th>title</th>\n",
       "      <th>genres</th>\n",
       "      <th>imdbId</th>\n",
       "      <th>tmdbId</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>315</th>\n",
       "      <td>319</td>\n",
       "      <td>Shallow Grave (1994)</td>\n",
       "      <td>Comedy|Drama|Thriller</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>316</th>\n",
       "      <td>320</td>\n",
       "      <td>Suture (1993)</td>\n",
       "      <td>Film-Noir|Thriller</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>317</th>\n",
       "      <td>321</td>\n",
       "      <td>Strawberry and Chocolate (Fresa y chocolate) (...</td>\n",
       "      <td>Drama</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>318</th>\n",
       "      <td>322</td>\n",
       "      <td>Swimming with Sharks (1995)</td>\n",
       "      <td>Comedy|Drama</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>319</th>\n",
       "      <td>324</td>\n",
       "      <td>Sum of Us, The (1994)</td>\n",
       "      <td>Comedy|Drama</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>320</th>\n",
       "      <td>325</td>\n",
       "      <td>National Lampoon's Senior Trip (1995)</td>\n",
       "      <td>Comedy</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>321</th>\n",
       "      <td>326</td>\n",
       "      <td>To Live (Huozhe) (1994)</td>\n",
       "      <td>Drama</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>322</th>\n",
       "      <td>327</td>\n",
       "      <td>Tank Girl (1995)</td>\n",
       "      <td>Action|Comedy|Sci-Fi</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>323</th>\n",
       "      <td>328</td>\n",
       "      <td>Tales from the Crypt Presents: Demon Knight (1...</td>\n",
       "      <td>Horror|Thriller</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>324</th>\n",
       "      <td>329</td>\n",
       "      <td>Star Trek: Generations (1994)</td>\n",
       "      <td>Adventure|Drama|Sci-Fi</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     movieId                                              title  \\\n",
       "315      319                               Shallow Grave (1994)   \n",
       "316      320                                      Suture (1993)   \n",
       "317      321  Strawberry and Chocolate (Fresa y chocolate) (...   \n",
       "318      322                        Swimming with Sharks (1995)   \n",
       "319      324                              Sum of Us, The (1994)   \n",
       "320      325              National Lampoon's Senior Trip (1995)   \n",
       "321      326                            To Live (Huozhe) (1994)   \n",
       "322      327                                   Tank Girl (1995)   \n",
       "323      328  Tales from the Crypt Presents: Demon Knight (1...   \n",
       "324      329                      Star Trek: Generations (1994)   \n",
       "\n",
       "                     genres  imdbId  tmdbId  \n",
       "315   Comedy|Drama|Thriller     NaN     NaN  \n",
       "316      Film-Noir|Thriller     NaN     NaN  \n",
       "317                   Drama     NaN     NaN  \n",
       "318            Comedy|Drama     NaN     NaN  \n",
       "319            Comedy|Drama     NaN     NaN  \n",
       "320                  Comedy     NaN     NaN  \n",
       "321                   Drama     NaN     NaN  \n",
       "322    Action|Comedy|Sci-Fi     NaN     NaN  \n",
       "323         Horror|Thriller     NaN     NaN  \n",
       "324  Adventure|Drama|Sci-Fi     NaN     NaN  "
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "left_merged_movies = movies.merge(_600_links  , on = [\"movieId\"], how = \"left\"        )\n",
    "left_merged_movies.iloc[315:325]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "14fb5dec-f6bd-420f-ac59-7da63fef9105",
   "metadata": {},
   "source": [
    "In the above merge, $D_{1}$ contains all of the observations with movie ids 319,320,...329. \n",
    "However, the dataset $D_{2}$ only contains the observation corresponding to movie id 320. \n",
    "Since there is a match for movie id 320 the columns in $D_{2}$ (imdbId\ttmdbId) are combined with $D_{1}$.\n",
    "Since, for example, there is no movie id 319 in $D_{2}$ the columns imdbId\ttmdbId are filled with ```nan```.\n",
    "\n",
    "### Right merge\n",
    "A right merge behaves exactly the same as a left merge with on exception.\n",
    "The second dataset in the merge ($D_{2}$) is the dataset that contains all observations and any observatins in $D_{1}$ that match $D_{2}$ will be added to $D_{2}$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "1820c72c-71d1-446a-baa7-da9152de4339",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movieId</th>\n",
       "      <th>title</th>\n",
       "      <th>genres</th>\n",
       "      <th>imdbId</th>\n",
       "      <th>tmdbId</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>315</th>\n",
       "      <td>48231</td>\n",
       "      <td>Taxidermia (2006)</td>\n",
       "      <td>Comedy|Drama|Horror</td>\n",
       "      <td>410730</td>\n",
       "      <td>15742.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>316</th>\n",
       "      <td>225924</td>\n",
       "      <td>The Call (2020)</td>\n",
       "      <td>(no genres listed)</td>\n",
       "      <td>12971924</td>\n",
       "      <td>746817.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>317</th>\n",
       "      <td>95567</td>\n",
       "      <td>People Like Us (2012)</td>\n",
       "      <td>Drama</td>\n",
       "      <td>1716777</td>\n",
       "      <td>98548.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>318</th>\n",
       "      <td>157085</td>\n",
       "      <td>O Coronel e o Lobisomem (2005)</td>\n",
       "      <td>Comedy</td>\n",
       "      <td>456117</td>\n",
       "      <td>29009.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>319</th>\n",
       "      <td>89924</td>\n",
       "      <td>I Want Candy (2007)</td>\n",
       "      <td>Comedy</td>\n",
       "      <td>791309</td>\n",
       "      <td>13687.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>320</th>\n",
       "      <td>187179</td>\n",
       "      <td>Kyrsyä: Tuftland (2018)</td>\n",
       "      <td>Horror|Mystery</td>\n",
       "      <td>5104330</td>\n",
       "      <td>484237.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>321</th>\n",
       "      <td>279192</td>\n",
       "      <td>Simchas and Sorrows (2022)</td>\n",
       "      <td>Comedy|Drama</td>\n",
       "      <td>14871354</td>\n",
       "      <td>872201.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>322</th>\n",
       "      <td>286839</td>\n",
       "      <td>Personality Crisis: One Night Only (2022)</td>\n",
       "      <td>Documentary</td>\n",
       "      <td>12667838</td>\n",
       "      <td>1013574.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>323</th>\n",
       "      <td>3499</td>\n",
       "      <td>Misery (1990)</td>\n",
       "      <td>Drama|Horror|Thriller</td>\n",
       "      <td>100157</td>\n",
       "      <td>1700.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>324</th>\n",
       "      <td>220370</td>\n",
       "      <td>The Dalai Lama: Scientist</td>\n",
       "      <td>Documentary</td>\n",
       "      <td>7749892</td>\n",
       "      <td>689200.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     movieId                                      title  \\\n",
       "315    48231                          Taxidermia (2006)   \n",
       "316   225924                            The Call (2020)   \n",
       "317    95567                      People Like Us (2012)   \n",
       "318   157085             O Coronel e o Lobisomem (2005)   \n",
       "319    89924                        I Want Candy (2007)   \n",
       "320   187179                    Kyrsyä: Tuftland (2018)   \n",
       "321   279192                 Simchas and Sorrows (2022)   \n",
       "322   286839  Personality Crisis: One Night Only (2022)   \n",
       "323     3499                              Misery (1990)   \n",
       "324   220370                  The Dalai Lama: Scientist   \n",
       "\n",
       "                    genres    imdbId     tmdbId  \n",
       "315    Comedy|Drama|Horror    410730    15742.0  \n",
       "316     (no genres listed)  12971924   746817.0  \n",
       "317                  Drama   1716777    98548.0  \n",
       "318                 Comedy    456117    29009.0  \n",
       "319                 Comedy    791309    13687.0  \n",
       "320         Horror|Mystery   5104330   484237.0  \n",
       "321           Comedy|Drama  14871354   872201.0  \n",
       "322            Documentary  12667838  1013574.0  \n",
       "323  Drama|Horror|Thriller    100157     1700.0  \n",
       "324            Documentary   7749892   689200.0  "
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "right_merged_movies = movies.merge(_600_links  , on = [\"movieId\"], how = \"right\"        )\n",
    "right_merged_movies.iloc[315:325]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b0a19fa9-4c9a-46b9-a489-540caf58b52f",
   "metadata": {},
   "source": [
    "## In class excercise and exploration"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "b8092fa6-4997-4c81-ba3b-49adc53c27af",
   "metadata": {},
   "outputs": [],
   "source": [
    "covid= pd.read_csv(\"https://data.cdc.gov/resource/pwn4-m3yp.csv?$limit=500000\")\n",
    "\n",
    "#--NOT NEEDED FOR LESSON, this code matches weeks between covid and flu---------------------------------\n",
    "def from_day_to_closest_saturday(x):\n",
    "    from datetime import datetime, timedelta\n",
    "    day =  datetime.strptime(x.split(\"T\")[0], \"%Y-%m-%d\")\n",
    "    while day.weekday()!=5:\n",
    "        day = day + timedelta(days=1)\n",
    "    return day.strftime(\"%Y-%m-%d\")\n",
    "end_of_weeks = [from_day_to_closest_saturday(x) for x in covid.end_date]\n",
    "\n",
    "covid[\"end_of_weeks\"] = end_of_weeks\n",
    "#------------------------------------------------------------\n",
    "\n",
    "flu = pd.read_csv(\"https://raw.githubusercontent.com/cdcepi/FluSight-forecast-hub/refs/heads/main/target-data/target-hospital-admissions.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c5ebe1fe-bd58-4b85-b9d6-8afd6422de92",
   "metadata": {},
   "source": [
    "## Exercises\n",
    "1. Subset the COVID dataset to one state (your choice)\n",
    "2. Subset the Flu dataset to one state (same state as the COVID dataset)\n",
    "3. Plot the number of new: cases and deaths due to COVID in a single plot with two \"axes\"\n",
    "4. Plot the number of incident hospitalizations (value column) per week due to Flu\n",
    "5. Merge together the COVID and Flu data based on week. This will be the ```end_of_weeks``` variable in the COVID dataset and ```date``` variable in Flu\n",
    "6. Plot the number of new cases due to COVID and the number of new cases due to Flu on a single plot.\n",
    "7. Realize that this isnt very helpful\n",
    "8. Read this [documentation](https://matplotlib.org/stable/gallery/subplots_axes_and_figures/two_scales.html#sphx-glr-gallery-subplots-axes-and-figures-two-scales-py) and twin your axes\n",
    "9. Plot COVID on one axis and Flu on the twin"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}