Chapter 03: Split-Apply-Combine and Merging#

Split - Apply - Combine#

A common task in statistical programming is to apply the same function to many groups of observations and return the results. Because individual outcomes/observations correspond to rows in a dataframe, we often need an additional column (or variable) that identifies which row belongs to which group. This additional column is almost always some set of fixed possibilities.


Example Australian Doctor visits : The dataset that we will explore is a cross section of information about Australian doctor visits between the years 1977 and 1978. The reported 5,190 observations correspond to patients treated within this decade and the columns refer to information like the number of visits to a physician in the past two weeks (visits), age in years divided by 100 (age), the annual income of the patient in tens of thousands, etc.


import pandas as pd
import numpy as np 

doctorVisits = pd.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/AER/DoctorVisits.csv")
doctorVisits
rownames visits gender age income illness reduced health private freepoor freerepat nchronic lchronic
0 1 1 female 0.19 0.55 1 4 1 yes no no no no
1 2 1 female 0.19 0.45 1 2 1 yes no no no no
2 3 1 male 0.19 0.90 3 0 0 no no no no no
3 4 1 male 0.19 0.15 1 0 0 no no no no no
4 5 1 male 0.19 0.45 2 5 1 no no no yes no
... ... ... ... ... ... ... ... ... ... ... ... ... ...
5185 5186 0 female 0.22 0.55 0 0 0 no no no no no
5186 5187 0 male 0.27 1.30 0 0 1 no no no no no
5187 5188 0 female 0.37 0.25 1 0 1 no no yes no no
5188 5189 0 female 0.52 0.65 0 0 0 no no no no no
5189 5190 0 male 0.72 0.25 0 0 0 no no yes no no

5190 rows × 13 columns

We can see that there exist specific column that can group together rows. For example, the variable private has values “yes” and “no” that correspond to whether the patient does or does not have private health insurance. This means that we can split pour dataframe into those who have private insurance and those who do not.

private_insurance    = doctorVisits.loc[doctorVisits.private   =="yes"]
no_private_insurance = doctorVisits.loc[doctorVisits["private"]=="no"]

Note: the two different ways to refer to a variable above. We have discussed before that you can refer to a variable using square brackets and the name of the variable enclosed in quotes. However, when a dataframe is created every column name is added as an attribute. That means we can refer to the column private using either doctorVisits["private"] or doctorVisits.private.

Now that our observations have been split into groups, we may want to apply a function that computes the mean and standard deviation of the number of visits. We will create a function called summarize_visits that inputs a dataframe and outputs a dictionary with those two summary metrics.

def summarize_visits(d,label):
    import numpy as np 
    mean_visits = np.mean(d.visits)
    sd_visits   = np.std(d.visits)
    return pd.Series({"group": label,  "mean_visits": mean_visits, "sd_visits":sd_visits})

#--lets apply the above function to both data frames above
summary__private    = summarize_visits(private_insurance   ,"yes")
summary__no_private = summarize_visits(no_private_insurance,"no")

Lets look at one of these Series to get a feel for what they look like.

summary__private
group               yes
mean_visits    0.294604
sd_visits      0.771995
dtype: object

The final step would be to combine our results from the different groups.

combine = pd.DataFrame([summary__private, summary__no_private])
combine
group mean_visits sd_visits
0 yes 0.294604 0.771995
1 no 0.307400 0.818130

Groupby as a (much easier) implementation of the Split-Apply-Combine paradigm#

The above procedure has been used so many times that computer scientists decided to optimize this process. The code to implement the procedure is easier and will compute faster than if we implemented the above three steps. For Python, we use groupby.

Given a dataframe, \(\mathcal{F}\), Groupby takes two inputs: (1) a list of column names for how to define a group. For a list of column names \([c_{1},c_{2},\cdots,c_{n}]\), define the set of unique values in column \(c_{1}\) as \(C_{1}\), the unique values in \(c_{2}\) as \(C_{2}\), etc. Then a group is an element of \(C_{1} \times C_{2} \times \cdots \times C_{n}\). Where \(A \times B = \{ (x,y): x \in A \text{ and } y \in B \}\) is called the Cartesian product of \(A\) and \(B\).

Input (2) is a function to apply to each group. This function should input a dataframe and return any Python object.

Lets try to use thie groupby technique to return the mean and standard deviation of the number of visist, stratified by private insurance.

def summarize_visits(d):
    import numpy as np 
    mean_visits = np.mean(d.visits)
    sd_visits   = np.std(d.visits)
    return pd.Series({"mean_visits": mean_visits, "sd_visits":sd_visits})

doctorVisits.groupby(["private"]).apply( summarize_visits,include_groups=False )
mean_visits sd_visits
private
no 0.307400 0.818130
yes 0.294604 0.771995

Merging#

WHen you want to combine two or more datasets this is called merging. There are several types of merges, but they all rely on the same process. Given a dataset \(D_{1}\) and a second dataset \(D_{2}\), we want to combine them into a dataset \(D\) such that

  1. The same observation in \(D_{1}\) is matched with \(D_{2}\)

  2. The columns in \(D_{1}\) are retained and the columns in \(D_{2}\) are retained.

The two most common types of merges are 1-1 merging and 1-manymerging.

1-1 merging#

A 1-1 merge assumes that there is the same, unique observation present in both \(D_{1}\) and \(D_{2}\). For our example we’ll use a, non-health, but instructive, database called MovieLens.

The MovieLens database contains four datasets: movies, links, ratings, and tags. Lets look at the movies and links datasets.

import pandas as pd
movies = pd.read_csv("./movies.csv")
movies
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy
... ... ... ...
87580 292731 The Monroy Affaire (2022) Drama
87581 292737 Shelter in Solitude (2023) Comedy|Drama
87582 292753 Orca (2023) Drama
87583 292755 The Angry Breed (1968) Drama
87584 292757 Race to the Summit (2023) Action|Adventure|Documentary

87585 rows × 3 columns

links  = pd.read_csv("./links.csv")
links
movieId imdbId tmdbId
0 1 114709 862.0
1 2 113497 8844.0
2 3 113228 15602.0
3 4 114885 31357.0
4 5 113041 11862.0
... ... ... ...
87580 292731 26812510 1032473.0
87581 292737 14907358 986674.0
87582 292753 12388280 948139.0
87583 292755 64027 182776.0
87584 292757 28995566 1174725.0

87585 rows × 3 columns

Suppose that we want to “add in “ the columns from the links dataset with the movies dataset. That is, the experiment here is the production of a move and a single “observation” in this experiment is one movie.

Pandas has a command called merge that can combine these datasets. To merge, we will need to provide dataset one, dataset two, and the columns that are used in both datasets to identify on, unique observation.

#d1----------d2-----how to define a unique obs
movies.merge(links  , on = ["movieId"]        )
movieId title genres imdbId tmdbId
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 114709 862.0
1 2 Jumanji (1995) Adventure|Children|Fantasy 113497 8844.0
2 3 Grumpier Old Men (1995) Comedy|Romance 113228 15602.0
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance 114885 31357.0
4 5 Father of the Bride Part II (1995) Comedy 113041 11862.0
... ... ... ... ... ...
87580 292731 The Monroy Affaire (2022) Drama 26812510 1032473.0
87581 292737 Shelter in Solitude (2023) Comedy|Drama 14907358 986674.0
87582 292753 Orca (2023) Drama 12388280 948139.0
87583 292755 The Angry Breed (1968) Drama 64027 182776.0
87584 292757 Race to the Summit (2023) Action|Adventure|Documentary 28995566 1174725.0

87585 rows × 5 columns

We can see that our merge produced a new, combined, dataset: one with all the columns in movies plus all the columns in links. Important: it is crucial to check that the number of rows in the original and final datasets match what you would expect. Here we expected that there would exist a match in each dataset. What happens if that isnt the case?

Matching (inner, left, right)#

Lets remove all but 600 movies from the links dataset and see what happens when we merge this time.

import numpy  as np 
pick_600_rows = np.random.choice( len(links), 600 ) #<-- seleect 600 rows at random
_600_links = links.iloc[ pick_600_rows ]            #<--use iloc to pick 600
movies.merge(_600_links  , on = ["movieId"]        )
movieId title genres imdbId tmdbId
0 81 Things to Do in Denver When You're Dead (1995) Crime|Drama|Romance 114660 400.0
1 491 Man Without a Face, The (1993) Drama 107501 10502.0
2 631 All Dogs Go to Heaven 2 (1996) Adventure|Animation|Children|Fantasy|Musical|R... 115509 19042.0
3 736 Twister (1996) Action|Adventure|Romance|Thriller 117998 664.0
4 1278 Young Frankenstein (1974) Comedy|Fantasy 72431 3034.0
... ... ... ... ... ...
595 288779 Don Camillo: Monsignor (1961) Comedy 54814 11580.0
596 289639 Mob Land (2023) Action|Crime|Thriller 20424130 979275.0
597 289937 The Gold Machine (2022) Documentary 14065482 1012073.0
598 291829 Love Is in the Air (2023) Comedy|Drama|Romance 28073548 1167725.0
599 292049 Odayaka (2012) Drama 2299491 153789.0

600 rows × 5 columns

What happened? Even though the movies dataset had 87,585 movies, only 600 appear in our merge! Where are our movies? The default behavior for merge is to run what is called an “inner merge”.

An inner merge looks at the unique observations in \(D_{1}\), finds those observations that match in \(D_{2}\) and the nonly returns those observations that appear in both datasets.

Left Merge#

A “left” merge will keep all observations from \(D_{1}\) and add-in matching observations from \(D_{2}\). Columns from \(D_{2}\) will be combined with columns in \(D_{1}\). If there is no matching observation in \(D_{2}\) then the columns that correspond to \(D_{2}\) will be filled with the value nan.

An example is useful to illustrate the impact of a left merge. To run a left merge in Python we use the same merge command and add the keyword how="left".

left_merged_movies = movies.merge(_600_links  , on = ["movieId"], how = "left"        )
left_merged_movies.iloc[315:325]
movieId title genres imdbId tmdbId
315 319 Shallow Grave (1994) Comedy|Drama|Thriller NaN NaN
316 320 Suture (1993) Film-Noir|Thriller NaN NaN
317 321 Strawberry and Chocolate (Fresa y chocolate) (... Drama NaN NaN
318 322 Swimming with Sharks (1995) Comedy|Drama NaN NaN
319 324 Sum of Us, The (1994) Comedy|Drama NaN NaN
320 325 National Lampoon's Senior Trip (1995) Comedy NaN NaN
321 326 To Live (Huozhe) (1994) Drama NaN NaN
322 327 Tank Girl (1995) Action|Comedy|Sci-Fi NaN NaN
323 328 Tales from the Crypt Presents: Demon Knight (1... Horror|Thriller NaN NaN
324 329 Star Trek: Generations (1994) Adventure|Drama|Sci-Fi NaN NaN

In the above merge, \(D_{1}\) contains all of the observations with movie ids 319,320,…329. However, the dataset \(D_{2}\) only contains the observation corresponding to movie id 320. Since there is a match for movie id 320 the columns in \(D_{2}\) (imdbId tmdbId) are combined with \(D_{1}\). Since, for example, there is no movie id 319 in \(D_{2}\) the columns imdbId tmdbId are filled with nan.

Right merge#

A right merge behaves exactly the same as a left merge with on exception. The second dataset in the merge (\(D_{2}\)) is the dataset that contains all observations and any observatins in \(D_{1}\) that match \(D_{2}\) will be added to \(D_{2}\).

right_merged_movies = movies.merge(_600_links  , on = ["movieId"], how = "right"        )
right_merged_movies.iloc[315:325]
movieId title genres imdbId tmdbId
315 277880 Les Segpa (2022) Comedy 19733052 922705.0
316 77444 Best of Times, The (Mei li shi guang) (2001) Crime|Drama|Thriller 333902 121036.0
317 172527 Hooked on the Game (2009) Action|Sci-Fi 1532382 37656.0
318 219025 The Cabinet of Jan Švankmajer (1984) Animation 87016 59355.0
319 154749 The Tournament (1974) Action|Drama 72444 90567.0
320 190625 The Unicorn (2018) Comedy 7149336 483351.0
321 130668 Mako: The Jaws of Death (1976) Horror|Thriller 74845 97559.0
322 199105 Unveiled (2005) Drama 428672 56823.0
323 166812 Seeing Red: Stories of American Communists (1983) (no genres listed) 86273 136058.0
324 212100 100 Miles To 40 (2010) Documentary 1591584 631164.0

In class excercise and exploration#

covid= pd.read_csv("https://data.cdc.gov/resource/pwn4-m3yp.csv?$limit=500000")

#--NOT NEEDED FOR LESSON, this code matches weeks between covid and flu---------------------------------
def from_day_to_closest_saturday(x):
    from datetime import datetime, timedelta
    day =  datetime.strptime(x.split("T")[0], "%Y-%m-%d")
    while day.weekday()!=5:
        day = day + timedelta(days=1)
    return day.strftime("%Y-%m-%d")
end_of_weeks = [from_day_to_closest_saturday(x) for x in covid.end_date]

covid["end_of_weeks"] = end_of_weeks
#------------------------------------------------------------

flu = pd.read_csv("https://raw.githubusercontent.com/cdcepi/FluSight-forecast-hub/refs/heads/main/target-data/target-hospital-admissions.csv")

Exercises#

  1. Subset the COVID dataset to one state (your choice)

  2. Subset the Flu dataset to one state (same state as the COVID dataset)

  3. Plot the number of new: cases and deaths due to COVID in a single plot with two “axes”

  4. Plot the number of incident hospitalizations (value column) per week due to Flu

  5. Merge together the COVID and Flu data based on week. This will be the end_of_weeks variable in the COVID dataset and date variable in Flu

  6. Plot the number of new cases due to COVID and the number of new cases due to Flu on a single plot.

  7. Realize that this isnt very helpful

  8. Read this documentation and twin your axes

  9. Plot COVID on one axis and Flu on the twin