Chapter 4: Split-Apply-Combine and Merging

Chapter 4: Split-Apply-Combine and Merging #

Split - Apply - Combine #

A common task in statistical programming is to apply the same function to many groups of observations and return the results. Because individual outcomes/observations correspond to rows in a dataframe, we often need an additional column (or variable) that identifies which row belongs to which group. This additional column is almost always some set of fixed possibilities.

Example Australian Doctor visits : The dataset that we will explore is a cross section of information about Australian doctor visits between the years 1977 and 1978. The reported 5,190 observations correspond to patients treated within this decade and the columns refer to information like the number of visits to a physician in the past two weeks (visits), age in years divided by 100 (age), the annual income of the patient in tens of thousands, etc.

import pandas as pd
import numpy as np 

doctorVisits = pd.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/AER/DoctorVisits.csv")
doctorVisits

	rownames	visits	gender	age	income	illness	reduced	health	private	freepoor	freerepat	nchronic	lchronic
0	1	1	female	0.19	0.55	1	4	1	yes	no	no	no	no
1	2	1	female	0.19	0.45	1	2	1	yes	no	no	no	no
2	3	1	male	0.19	0.90	3	0	0	no	no	no	no	no
3	4	1	male	0.19	0.15	1	0	0	no	no	no	no	no
4	5	1	male	0.19	0.45	2	5	1	no	no	no	yes	no
...	...	...	...	...	...	...	...	...	...	...	...	...	...
5185	5186	0	female	0.22	0.55	0	0	0	no	no	no	no	no
5186	5187	0	male	0.27	1.30	0	0	1	no	no	no	no	no
5187	5188	0	female	0.37	0.25	1	0	1	no	no	yes	no	no
5188	5189	0	female	0.52	0.65	0	0	0	no	no	no	no	no
5189	5190	0	male	0.72	0.25	0	0	0	no	no	yes	no	no

5190 rows × 13 columns

We can see that there exist specific column that can group together rows. For example, the variable private has values “yes” and “no” that correspond to whether the patient does or does not have private health insurance. This means that we can split pour dataframe into those who have private insurance and those who do not.

private_insurance    = doctorVisits.loc[doctorVisits.private   =="yes"]
no_private_insurance = doctorVisits.loc[doctorVisits["private"]=="no"]

Note: the two different ways to refer to a variable above. We have discussed before that you can refer to a variable using square brackets and the name of the variable enclosed in quotes. However, when a dataframe is created every column name is added as an attribute. That means we can refer to the column private using either doctorVisits["private"] or doctorVisits.private.

Now that our observations have been split into groups, we may want to apply a function that computes the mean and standard deviation of the number of visits. We will create a function called summarize_visits that inputs a dataframe and outputs a dictionary with those two summary metrics.

def summarize_visits(d,label):
    import numpy as np 
    mean_visits = np.mean(d.visits)
    sd_visits   = np.std(d.visits)
    return pd.Series({"group": label,  "mean_visits": mean_visits, "sd_visits":sd_visits})

#--lets apply the above function to both data frames above
summary__private    = summarize_visits(private_insurance   ,"yes")
summary__no_private = summarize_visits(no_private_insurance,"no")

Lets look at one of these Series to get a feel for what they look like.

summary__private

group               yes
mean_visits    0.294604
sd_visits      0.771995
dtype: object

The final step would be to combine our results from the different groups.

combine = pd.DataFrame([summary__private, summary__no_private])
combine

	group	mean_visits	sd_visits
0	yes	0.294604	0.771995
1	no	0.307400	0.818130

Groupby as a (much easier) implementation of the Split-Apply-Combine paradigm #

The above procedure has been used so many times that computer scientists decided to optimize this process. The code to implement the procedure is easier and will compute faster than if we implemented the above three steps. For Python, we use groupby.

Given a dataframe, \(\mathcal{F}\), Groupby takes two inputs: (1) a list of column names for how to define a group. For a list of column names \([c_{1},c_{2},\cdots,c_{n}]\), define the set of unique values in column \(c_{1}\) as \(C_{1}\), the unique values in \(c_{2}\) as \(C_{2}\), etc. Then a group is an element of \(C_{1} \times C_{2} \times \cdots \times C_{n}\). Where \(A \times B = \{ (x,y): x \in A \text{ and } y \in B \}\) is called the Cartesian product of \(A\) and \(B\).

Input (2) is a function to apply to each group. This function should input a dataframe and return any Python object.

Lets try to use thie groupby technique to return the mean and standard deviation of the number of visist, stratified by private insurance.

def summarize_visits(d):
    import numpy as np 
    mean_visits = np.mean(d.visits)
    sd_visits   = np.std(d.visits)
    return pd.Series({"mean_visits": mean_visits, "sd_visits":sd_visits})

doctorVisits.groupby(["private"]).apply( summarize_visits,include_groups=False )

	mean_visits	sd_visits
private
no	0.307400	0.818130
yes	0.294604	0.771995

Merging #

WHen you want to combine two or more datasets this is called merging. There are several types of merges, but they all rely on the same process. Given a dataset \(D_{1}\) and a second dataset \(D_{2}\), we want to combine them into a dataset \(D\) such that

The same observation in \(D_{1}\) is matched with \(D_{2}\)
The columns in \(D_{1}\) are retained and the columns in \(D_{2}\) are retained.

The two most common types of merges are 1-1 merging and 1-manymerging.

1-1 merging #

A 1-1 merge assumes that there is the same, unique observation present in both \(D_{1}\) and \(D_{2}\). For our example we’ll use a, non-health, but instructive, database called MovieLens.

The MovieLens database contains four datasets: movies, links, ratings, and tags. Lets look at the movies and links datasets.

import pandas as pd
movies = pd.read_csv("./movies.csv")
movies

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[7], line 2
import pandas as pd
----> 2 movies = pd.read_csv("./movies.csv")
movies

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
kwds_defaults = _refine_defaults_read(
   dialect,
   delimiter,
   (...)   1022     dtype_backend=dtype_backend,
)
kwds.update(kwds_defaults)
-> 1026 return _read(filepath_or_buffer, kwds)

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/pandas/io/parsers/readers.py:620, in _read(filepath_or_buffer, kwds)
_validate_names(kwds.get("names", None))
# Create the parser.
--> 620 parser = TextFileReader(filepath_or_buffer, **kwds)
if chunksize or iterator:
   return parser

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1620, in TextFileReader.__init__(self, f, engine, **kwds)
   self.options["has_index_names"] = kwds["has_index_names"]
self.handles: IOHandles | None = None
-> 1620 self._engine = self._make_engine(f, self.engine)

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1880, in TextFileReader._make_engine(self, f, engine)
   if "b" not in mode:
       mode += "b"
-> 1880 self.handles = get_handle(
   f,
   mode,
   encoding=self.options.get("encoding", None),
   compression=self.options.get("compression", None),
   memory_map=self.options.get("memory_map", False),
   is_text=is_text,
   errors=self.options.get("encoding_errors", "strict"),
   storage_options=self.options.get("storage_options", None),
)
assert self.handles is not None
f = self.handles.handle

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/pandas/io/common.py:873, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
elif isinstance(handle, str):
   # Check whether the filename is to be opened in binary mode.
   # Binary mode does not support 'encoding' and 'newline'.
   if ioargs.encoding and "b" not in ioargs.mode:
       # Encoding
--> 873         handle = open(
           handle,
           ioargs.mode,
           encoding=ioargs.encoding,
           errors=errors,
           newline="",
       )
   else:
       # Binary mode
       handle = open(handle, ioargs.mode)

FileNotFoundError: [Errno 2] No such file or directory: './movies.csv'

links  = pd.read_csv("./links.csv")
links

	movieId	imdbId	tmdbId
0	1	114709	862.0
1	2	113497	8844.0
2	3	113228	15602.0
3	4	114885	31357.0
4	5	113041	11862.0
...	...	...	...
87580	292731	26812510	1032473.0
87581	292737	14907358	986674.0
87582	292753	12388280	948139.0
87583	292755	64027	182776.0
87584	292757	28995566	1174725.0

87585 rows × 3 columns

Suppose that we want to “add in “ the columns from the links dataset with the movies dataset. That is, the experiment here is the production of a move and a single “observation” in this experiment is one movie.

Pandas has a command called merge that can combine these datasets. To merge, we will need to provide dataset one, dataset two, and the columns that are used in both datasets to identify on, unique observation.

#d1----------d2-----how to define a unique obs
movies.merge(links  , on = ["movieId"]        )

	movieId	title	genres	imdbId	tmdbId
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	114709	862.0
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy	113497	8844.0
2	3	Grumpier Old Men (1995)	Comedy\|Romance	113228	15602.0
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance	114885	31357.0
4	5	Father of the Bride Part II (1995)	Comedy	113041	11862.0
...	...	...	...	...	...
87580	292731	The Monroy Affaire (2022)	Drama	26812510	1032473.0
87581	292737	Shelter in Solitude (2023)	Comedy\|Drama	14907358	986674.0
87582	292753	Orca (2023)	Drama	12388280	948139.0
87583	292755	The Angry Breed (1968)	Drama	64027	182776.0
87584	292757	Race to the Summit (2023)	Action\|Adventure\|Documentary	28995566	1174725.0

87585 rows × 5 columns

We can see that our merge produced a new, combined, dataset: one with all the columns in movies plus all the columns in links. Important: it is crucial to check that the number of rows in the original and final datasets match what you would expect. Here we expected that there would exist a match in each dataset. What happens if that isnt the case?

Matching (inner, left, right)#

Lets remove all but 600 movies from the links dataset and see what happens when we merge this time.

import numpy  as np 
pick_600_rows = np.random.choice( len(links), 600 ) #<-- seleect 600 rows at random
_600_links = links.iloc[ pick_600_rows ]            #<--use iloc to pick 600

movies.merge(_600_links  , on = ["movieId"]        )

	movieId	title	genres	imdbId	tmdbId
0	95	Broken Arrow (1996)	Action\|Adventure\|Thriller	115759	9208.0
1	202	Total Eclipse (1995)	Drama\|Romance	114702	36834.0
2	220	Castle Freak (1995)	Horror	112643	18256.0
3	308	Three Colors: White (Trzy kolory: Bialy) (1994)	Comedy\|Drama	111507	109.0
4	529	Searching for Bobby Fischer (1993)	Drama	108065	14291.0
...	...	...	...	...	...
595	290313	Turtle Vision (1991)	Drama\|Horror	103142	254874.0
596	290615	The Ones You Didn’t Burn (2022)	Horror	14080416	974264.0
597	290783	The Princesses and the Mysterious Guest (2023)	Adventure\|Animation	26761266	1078843.0
598	291142	Born Again (2015)	Drama	4940528	420085.0
599	291310	Moosie (1999)	Adventure\|Children	165885	304696.0

600 rows × 5 columns

What happened? Even though the movies dataset had 87,585 movies, only 600 appear in our merge! Where are our movies? The default behavior for merge is to run what is called an “inner merge”.

An inner merge looks at the unique observations in \(D_{1}\), finds those observations that match in \(D_{2}\) and the nonly returns those observations that appear in both datasets.

Left Merge #

A “left” merge will keep all observations from \(D_{1}\) and add-in matching observations from \(D_{2}\). Columns from \(D_{2}\) will be combined with columns in \(D_{1}\). If there is no matching observation in \(D_{2}\) then the columns that correspond to \(D_{2}\) will be filled with the value nan.

An example is useful to illustrate the impact of a left merge. To run a left merge in Python we use the same merge command and add the keyword how="left".

left_merged_movies = movies.merge(_600_links  , on = ["movieId"], how = "left"        )
left_merged_movies.iloc[315:325]

	movieId	title	genres	imdbId	tmdbId
315	319	Shallow Grave (1994)	Comedy\|Drama\|Thriller	NaN	NaN
316	320	Suture (1993)	Film-Noir\|Thriller	NaN	NaN
317	321	Strawberry and Chocolate (Fresa y chocolate) (...	Drama	NaN	NaN
318	322	Swimming with Sharks (1995)	Comedy\|Drama	NaN	NaN
319	324	Sum of Us, The (1994)	Comedy\|Drama	NaN	NaN
320	325	National Lampoon's Senior Trip (1995)	Comedy	NaN	NaN
321	326	To Live (Huozhe) (1994)	Drama	NaN	NaN
322	327	Tank Girl (1995)	Action\|Comedy\|Sci-Fi	NaN	NaN
323	328	Tales from the Crypt Presents: Demon Knight (1...	Horror\|Thriller	NaN	NaN
324	329	Star Trek: Generations (1994)	Adventure\|Drama\|Sci-Fi	NaN	NaN

In the above merge, \(D_{1}\) contains all of the observations with movie ids 319,320,…329. However, the dataset \(D_{2}\) only contains the observation corresponding to movie id 320. Since there is a match for movie id 320 the columns in \(D_{2}\) (imdbId tmdbId) are combined with \(D_{1}\). Since, for example, there is no movie id 319 in \(D_{2}\) the columns imdbId tmdbId are filled with nan.

Right merge #

A right merge behaves exactly the same as a left merge with on exception. The second dataset in the merge (\(D_{2}\)) is the dataset that contains all observations and any observatins in \(D_{1}\) that match \(D_{2}\) will be added to \(D_{2}\).

right_merged_movies = movies.merge(_600_links  , on = ["movieId"], how = "right"        )
right_merged_movies.iloc[315:325]

	movieId	title	genres	imdbId	tmdbId
315	48231	Taxidermia (2006)	Comedy\|Drama\|Horror	410730	15742.0
316	225924	The Call (2020)	(no genres listed)	12971924	746817.0
317	95567	People Like Us (2012)	Drama	1716777	98548.0
318	157085	O Coronel e o Lobisomem (2005)	Comedy	456117	29009.0
319	89924	I Want Candy (2007)	Comedy	791309	13687.0
320	187179	Kyrsyä: Tuftland (2018)	Horror\|Mystery	5104330	484237.0
321	279192	Simchas and Sorrows (2022)	Comedy\|Drama	14871354	872201.0
322	286839	Personality Crisis: One Night Only (2022)	Documentary	12667838	1013574.0
323	3499	Misery (1990)	Drama\|Horror\|Thriller	100157	1700.0
324	220370	The Dalai Lama: Scientist	Documentary	7749892	689200.0

In class excercise and exploration #

covid= pd.read_csv("https://data.cdc.gov/resource/pwn4-m3yp.csv?$limit=500000")

#--NOT NEEDED FOR LESSON, this code matches weeks between covid and flu---------------------------------
def from_day_to_closest_saturday(x):
    from datetime import datetime, timedelta
    day =  datetime.strptime(x.split("T")[0], "%Y-%m-%d")
    while day.weekday()!=5:
        day = day + timedelta(days=1)
    return day.strftime("%Y-%m-%d")
end_of_weeks = [from_day_to_closest_saturday(x) for x in covid.end_date]

covid["end_of_weeks"] = end_of_weeks
#------------------------------------------------------------

flu = pd.read_csv("https://raw.githubusercontent.com/cdcepi/FluSight-forecast-hub/refs/heads/main/target-data/target-hospital-admissions.csv")

Exercises #

Subset the COVID dataset to one state (your choice)
Subset the Flu dataset to one state (same state as the COVID dataset)
Plot the number of new: cases and deaths due to COVID in a single plot with two “axes”
Plot the number of incident hospitalizations (value column) per week due to Flu
Merge together the COVID and Flu data based on week. This will be the end_of_weeks variable in the COVID dataset and date variable in Flu
Plot the number of new cases due to COVID and the number of new cases due to Flu on a single plot.
Realize that this isnt very helpful
Read this documentation and twin your axes
Plot COVID on one axis and Flu on the twin

Chapter 4: Split-Apply-Combine and Merging

Contents