Chapter 02: Data, Dataframes, and Pandas#
Experiments, Outcomes, Datapoints, and Dataframes#
We define a set as a collection of items, sometimes called elements. A set is typically given a capital letter (for example \(A\)) and the elements are included inside curly braces.
We use \(\mathcal{G}\) to define the set of all possible outcomes from an experiment and call this set the sample space. The term “experiment” has a broad meaning. An experiment can mean everything from a randomized controlled trial to an observational study. Experiments are what produce observations that we collect and try to characterize. Here an experiment is the process that generates outcomes.
An outcome is defined as an element of the sample space, some result of the experiment that is recorded. An outcome is a single observation from an experiment, and we define an event as a set, or group, of outcomes. Most often we use \(o_{i}\) to denote an outcome and \(E_{i}\) to denote an event.
The sample space, event, and outcome are all potential results from an experiment. When we conduct an experiment we will generate an outcome from our sample space and call this realized outcome a data point.
Example Flipping a coin:
Consider the experiment of flipping a coin and recording whether the coin lands heads or tails side up.
We can define a sample space \(\mathcal{G} = \{H,T\}\) where \(H\) is an outcome that represents the coin landing heads up and \(T\) represents tails up.
The sample space includes all the possible events we wish to record, either the coin lands heads or it lands tails.
Up until this point we have structured our experiment, but we have not generated data. We flip the coin and the coin lands tails side up. Now we have performed an experiment and generated the data point \(T\). We flip again and record a heads. \(H\) is the second data point, and so on.
Now suppose that we conduct an experiment with the same sample space \((\mathcal{G})\) a number \(N\) times and with each experiment we record a data point. A tuple (i.e. ordered list) of data points \(d\) is called a data set \(\mathcal{D} = (d_{1}, d_{2}, d_{3}, \cdots, d_{N})\) where \(d_{i}\) is the data point generated from the \(i^\text{th}\) experiment. We say that we have drawn or that we have sampled a data set \(\mathcal{D}\). Further, data points \((d)\) are often called realized outcomes because they are no longer in a set of potential possibilities but are now determined items.
A data set \(\mathcal{D}\) can be unwieldy depending on the number of data points, the complexity of the sample space, or both. A data frame is one way to organize a data set. A data frame \(\mathcal{F}\) is a table where each data point \(d\) in a dataset \(\mathcal{D}\) is represented as a row in the table. Each data point may contains multiple pieces of information. That is, each data point may itself be a tuple. In this case, then a separate column is created for each position in the tuple.
Example Human predictions of infectious diseases:
Suppose we design an experiment to collect from humans predictions, two weeks ahead from the time of our experiment, of the number of incident cases and incident deaths at the US national level of COVID-19 (Source).
We decide to collect from each human whether they are an expert in the modeling of infectious disease, a prediction of incident cases, and a prediction of incident deaths.
We draw a data set \(\mathcal{D}\) of 50 human judgment predictions.
We can organize this data set into a data frame:
Expert |
Prediction of cases |
Prediction of deaths |
---|---|---|
Yes |
145 |
52 |
No |
215 |
34 |
Yes |
524 |
48 |
Yes |
265 |
95 |
No |
354 |
35 |
Table: Example data frame \(\mathcal{F}\) built from a data set \(\mathcal{D}\) that contains 5 data points where each data point is a tuple of length three.
Above, the first data point is \((\text{Yes},145,52)\), the second data point is \((\text{No}, 215, 34)\), and so on until the last data point \((\text{No}, 354, 35)\). A data frame can also include informative information for others such as labels for each column.
Pandas#
The pandas module in Python is, by far, the most used set of tools for interacting with data, data points, and data frames. The documentation for Pandas is available here = link.
Pandas allows a structured way to import, organize, access, and compute with data frames. But first, we need to discuss the fundemental object in pandas—the Series.
Series#
A Series is (1) a list of items plus (2) an index, a list of string values that are associated with each item in (1).
The typical may to define a Series is by called pd.Series
and inputting a list and index.
For example, the first two data points of out above coin flip experiment were tails and then heads.
Lets assign “tails” the value 0 and “heads” the value 1 to make this more numerically friendly.
import pandas as pd #<--Import pandas (only needed once)
coin_flips = pd.Series([0,1], index=["flip1", "flip2"]) #<--Create a Series
print(coin_flips) # Print this out so we can see what this object looks like
flip1 0
flip2 1
dtype: int64
We see that a series object was created where all the values are integers. This is a rule for Series, they cannot be “mixed” type such as character and integer or integer and floats (decimals values).
The index for our series is displayed on the left side. We can access items in a series using the index like this.
coin_flips.get("flip1")
np.int64(0)
or like this
coin_flips["flip1"]
np.int64(0)
Like most objects in Python, a series is a type of dictionary. The “keys” of the dictionary are the index values and the “values” of the dictionary are the items in the list. In fact, we can build a series from a dictionary.
coin_flips = pd.Series({"flip1":0,"flip2":1})
coin_flips
flip1 0
flip2 1
dtype: int64
Finally, it should be noted that pandas Series objects can, for the most part, be treated the same as numpy arrays. Series support vectorized operations. That is, adding/multiplying/dividing/subtracting a constant \(c\) applys to all items in the series.
coin_flips+2 #<--vectorized addition
flip1 2
flip2 3
dtype: int64
Adding/multiplying/dividing/subtracting two vectors is computed elementwise
coin_flips + (2* coin_flips-1)
flip1 -1
flip2 2
dtype: int64
and applying any function to the series applies this function to each item in the series.
import numpy as np
np.sqrt(np.exp(coin_flips+1))
flip1 1.648721
flip2 2.718282
dtype: float64
Data Frames#
In pandas, data frames are organized the same as our above definition for a data frame. Rows denote observations and each column corresponds to one piece of information about each outcome/observation.
To build a data frame in python from scratch, you can (1) define a dictionary such that the each key is a string that corresponds to a column name and each value is a list where the first item in the list corresponds to information about the first outcome (or observation), the second item in the list corresponds to information about the second observation, and so on. An example is helpful here.
Lets build the above data frame of human judgment predictions of COVID-19. We need three columns: expert, cases, deaths. The expert column has the information (Yes, No, Yes, Yes, No). The cases column has the information (145, 215, 524, 265, 35) and so on.
import pandas as pd
human_judgment_predictions = {"expert" :["Yes","No","Yes","Yes","No"]
, "cases" :[145 ,215 ,524 ,265 ,354]
, "deaths":[52 ,34 ,48 ,95 ,35]}
human_judgment_predictions_data_frame = pd.DataFrame(human_judgment_predictions) #<--This converts our dict to a pandas data frame
print(human_judgment_predictions_data_frame)
expert cases deaths
0 Yes 145 52
1 No 215 34
2 Yes 524 48
3 Yes 265 95
4 No 354 35
In many instances we will import a csv, excel, or similar type of tabular data into a python pandas data frame.
To load in a pandas data frame from a csv file, we can use the pd.read_csv
.
For an excel file we can use pd.read_excel
.
There are many “read” options in pandas.
Example Billion-Dollar Weather and Climate Disasters: US National Centers for Environmental Information maintains an inventory of the most costly such disasters in the US.
To load in the data called "events-US-1980-2024.csv"
that documents these disasters from 1980 to present, we can write
disasters = pd.read_csv("events-US-1980-2024.csv")
disasters
#--Not important for this lesson. (Lets replace the TBD in these cells with the value NAN)
import numpy as np
disasters = disasters.replace("TBD",np.nan)
disasters["Unadjusted Cost"] = disasters["Unadjusted Cost"].astype(float)
Selecting columns#
To select a column from a dataframe \(D\), we call the dataframe and use brackets to call the column we want to select. Like this.
disasters["Disaster"]
0 Flooding
1 Tropical Cyclone
2 Drought
3 Freeze
4 Severe Storm
...
395 Tropical Cyclone
396 Severe Storm
397 Tropical Cyclone
398 Tropical Cyclone
399 Tropical Cyclone
Name: Disaster, Length: 400, dtype: object
Note that the column name is wrapped in quotes, its a string. We can select multiple columns by providing a list of strings.
Below we will select the two columns “Disaster” and “Name” (the name of the disaster).
disasters[["Disaster","Name"]]
Disaster | Name | |
---|---|---|
0 | Flooding | Southern Severe Storms and Flooding (April 1980) |
1 | Tropical Cyclone | Hurricane Allen (August 1980) |
2 | Drought | Central/Eastern Drought/Heat Wave (Summer-Fall... |
3 | Freeze | Florida Freeze (January 1981) |
4 | Severe Storm | Severe Storms, Flash Floods, Hail, Tornadoes (... |
... | ... | ... |
395 | Tropical Cyclone | Hurricane Beryl (July 2024) |
396 | Severe Storm | Central and Eastern Tornado Outbreak and Sever... |
397 | Tropical Cyclone | Hurricane Debby (August 2024) |
398 | Tropical Cyclone | Hurricane Helene (September 2024) |
399 | Tropical Cyclone | Hurricane Milton (August 2024) |
400 rows × 2 columns
Selecting rows#
The first way to select rows is by slicing. We can slice rows by providing a range of rows to select from beginning:end. For example, if we want to select the 3rd, 4th, up to the 9th row, we can write
disasters[3:10]
Name | Disaster | Begin Date | End Date | CPI-Adjusted Cost | Unadjusted Cost | Deaths | |
---|---|---|---|---|---|---|---|
3 | Florida Freeze (January 1981) | Freeze | 19810112 | 19810114 | 2070.6 | 572.0 | 0 |
4 | Severe Storms, Flash Floods, Hail, Tornadoes (... | Severe Storm | 19810505 | 19810510 | 1405.2 | 401.4 | 20 |
5 | Midwest/Southeast/Northeast Winter Storm, Cold... | Winter Storm | 19820108 | 19820116 | 2211.1 | 662.0 | 85 |
6 | Midwest/Plains/Southeast Tornadoes (April 1982) | Severe Storm | 19820402 | 19820404 | 1599.5 | 483.2 | 33 |
7 | Severe Storms (June 1982) | Severe Storm | 19820531 | 19820610 | 1574.2 | 479.9 | 30 |
8 | Gulf States Storms and Flooding (December 1982... | Flooding | 19821201 | 19830115 | 4930.7 | 1536.1 | 45 |
9 | Western Storms and Flooding (December 1982-Mar... | Flooding | 19821213 | 19830331 | 4813.8 | 1499.6 | 50 |
More often than not, you will use the .loc
attribute to select rows.
You can select rows based on their index name
disasters.loc[4]
Name Severe Storms, Flash Floods, Hail, Tornadoes (...
Disaster Severe Storm
Begin Date 19810505
End Date 19810510
CPI-Adjusted Cost 1405.2
Unadjusted Cost 401.4
Deaths 20
Name: 4, dtype: object
and, even more important, you can select rows via boolean indexing. Boolean indexing supplies the data frame with a list that has the same length as the number of rows of the data frame that contains the values True and False. The rows corresponding to True values are returned.
Creating boolean lists based on our data frame is easy. For example, suppose we cant to identify (with True and False) the disasters that results in more than 20 deaths. We can write the following expression
disasters["Deaths"] > 20
0 False
1 False
2 True
3 False
4 False
...
395 True
396 False
397 False
398 True
399 True
Name: Deaths, Length: 400, dtype: bool
This Boolean list can be used to select all the rows with more than 20 deaths like
disasters.loc[disasters["Deaths"] > 20]
Name | Disaster | Begin Date | End Date | CPI-Adjusted Cost | Unadjusted Cost | Deaths | |
---|---|---|---|---|---|---|---|
2 | Central/Eastern Drought/Heat Wave (Summer-Fall... | Drought | 19800601 | 19801130 | 40480.8 | 10020.0 | 1260 |
5 | Midwest/Southeast/Northeast Winter Storm, Cold... | Winter Storm | 19820108 | 19820116 | 2211.1 | 662.0 | 85 |
6 | Midwest/Plains/Southeast Tornadoes (April 1982) | Severe Storm | 19820402 | 19820404 | 1599.5 | 483.2 | 33 |
7 | Severe Storms (June 1982) | Severe Storm | 19820531 | 19820610 | 1574.2 | 479.9 | 30 |
8 | Gulf States Storms and Flooding (December 1982... | Flooding | 19821201 | 19830115 | 4930.7 | 1536.1 | 45 |
... | ... | ... | ... | ... | ... | ... | ... |
374 | Southern/Midwestern Drought and Heatwave (Spri... | Drought | 20230401 | 20230930 | 14786.7 | 14082.0 | 247 |
378 | Central, Southern, Northeastern Winter Storm a... | Winter Storm | 20240114 | 20240118 | 1948.3 | 1910.0 | 41 |
395 | Hurricane Beryl (July 2024) | Tropical Cyclone | 20240708 | 20240708 | 7219 | 7219.0 | 45 |
398 | Hurricane Helene (September 2024) | Tropical Cyclone | 20240924 | 20240929 | NaN | NaN | 225 |
399 | Hurricane Milton (August 2024) | Tropical Cyclone | 20241009 | 20241010 | NaN | NaN | 24 |
115 rows × 7 columns
We can select both rows and columns using the .loc
attribute as well.
We can use the syntax df.loc[ <row selection>, <column selection> ]
For example, lets select rows with disasters with an unadjusted cost of more than 20000 billion and for columns we’ll select the name and type of disaster.
disasters.loc[disasters["Unadjusted Cost"]>20000 , ["Name","Disaster"]]
Name | Disaster | |
---|---|---|
26 | U.S. Drought/Heat Wave (Summer 1988) | Drought |
44 | Hurricane Andrew (August 1992) | Tropical Cyclone |
50 | Midwest Flooding (Summer 1993) | Flooding |
115 | Hurricane Ivan (September 2004) | Tropical Cyclone |
119 | Hurricane Katrina (August 2005) | Tropical Cyclone |
145 | Hurricane Ike (September 2008) | Tropical Cyclone |
190 | Hurricane Sandy (October 2012) | Tropical Cyclone |
192 | U.S. Drought/Heat Wave (2012) | Drought |
253 | Hurricane Harvey (August 2017) | Tropical Cyclone |
254 | Hurricane Irma (September 2017) | Tropical Cyclone |
255 | Hurricane Maria (September 2017) | Tropical Cyclone |
270 | Hurricane Florence (September 2018) | Tropical Cyclone |
271 | Hurricane Michael (October 2018) | Tropical Cyclone |
273 | Western Wildfires, California Firestorm (Summe... | Wildfire |
303 | Hurricane Laura (August 2020) | Tropical Cyclone |
311 | Northwest, Central, Eastern Winter Storm and C... | Winter Storm |
324 | Hurricane Ida (August 2021) | Tropical Cyclone |
343 | Hurricane Ian (September 2022) | Tropical Cyclone |
347 | Western/Central Drought and Heat Wave (2022) | Drought |
Adding new columns#
There are several different ways to add columns to a pandas data frame. The most straightforward way to add a column is to call the dataframe with a bracketed string and assign this the column values.
For example, suppose we want to add a column to the above disasters data frame. This new column is going to be the number of deaths minus the mean number of deaths as a result of the disaster.
disasters.assign( Flood = lambda x: x["Disaster"].str.contains("Flood") )
Name | Disaster | Begin Date | End Date | CPI-Adjusted Cost | Unadjusted Cost | Deaths | Flood | |
---|---|---|---|---|---|---|---|---|
0 | Southern Severe Storms and Flooding (April 1980) | Flooding | 19800410 | 19800417 | 2742.3 | 706.8 | 7 | True |
1 | Hurricane Allen (August 1980) | Tropical Cyclone | 19800807 | 19800811 | 2230.2 | 590.0 | 13 | False |
2 | Central/Eastern Drought/Heat Wave (Summer-Fall... | Drought | 19800601 | 19801130 | 40480.8 | 10020.0 | 1260 | False |
3 | Florida Freeze (January 1981) | Freeze | 19810112 | 19810114 | 2070.6 | 572.0 | 0 | False |
4 | Severe Storms, Flash Floods, Hail, Tornadoes (... | Severe Storm | 19810505 | 19810510 | 1405.2 | 401.4 | 20 | False |
... | ... | ... | ... | ... | ... | ... | ... | ... |
395 | Hurricane Beryl (July 2024) | Tropical Cyclone | 20240708 | 20240708 | 7219 | 7219.0 | 45 | False |
396 | Central and Eastern Tornado Outbreak and Sever... | Severe Storm | 20240713 | 20240716 | 2435 | 2435.0 | 2 | False |
397 | Hurricane Debby (August 2024) | Tropical Cyclone | 20240805 | 20240809 | 2476 | 2476.0 | 10 | False |
398 | Hurricane Helene (September 2024) | Tropical Cyclone | 20240924 | 20240929 | NaN | NaN | 225 | False |
399 | Hurricane Milton (August 2024) | Tropical Cyclone | 20241009 | 20241010 | NaN | NaN | 24 | False |
400 rows × 8 columns
Its important to note that the assign
function returns the entire dataframe.
This means that if we want to add the column “Flood” to our original dataframe we need to write.
disasters = disasters.assign( Flood = lambda x: x["Disaster"].str.contains("Flood") )
disasters
Name | Disaster | Begin Date | End Date | CPI-Adjusted Cost | Unadjusted Cost | Deaths | Flood | |
---|---|---|---|---|---|---|---|---|
0 | Southern Severe Storms and Flooding (April 1980) | Flooding | 19800410 | 19800417 | 2742.3 | 706.8 | 7 | True |
1 | Hurricane Allen (August 1980) | Tropical Cyclone | 19800807 | 19800811 | 2230.2 | 590.0 | 13 | False |
2 | Central/Eastern Drought/Heat Wave (Summer-Fall... | Drought | 19800601 | 19801130 | 40480.8 | 10020.0 | 1260 | False |
3 | Florida Freeze (January 1981) | Freeze | 19810112 | 19810114 | 2070.6 | 572.0 | 0 | False |
4 | Severe Storms, Flash Floods, Hail, Tornadoes (... | Severe Storm | 19810505 | 19810510 | 1405.2 | 401.4 | 20 | False |
... | ... | ... | ... | ... | ... | ... | ... | ... |
395 | Hurricane Beryl (July 2024) | Tropical Cyclone | 20240708 | 20240708 | 7219 | 7219.0 | 45 | False |
396 | Central and Eastern Tornado Outbreak and Sever... | Severe Storm | 20240713 | 20240716 | 2435 | 2435.0 | 2 | False |
397 | Hurricane Debby (August 2024) | Tropical Cyclone | 20240805 | 20240809 | 2476 | 2476.0 | 10 | False |
398 | Hurricane Helene (September 2024) | Tropical Cyclone | 20240924 | 20240929 | NaN | NaN | 225 | False |
399 | Hurricane Milton (August 2024) | Tropical Cyclone | 20241009 | 20241010 | NaN | NaN | 24 | False |
400 rows × 8 columns
In the above we used a special attribute, str
in pandas.
We’ll learn more about how to deal with strings in next week’s lesson.
Renaming columns#
To rename a column in our pandas
dataframe, we can use the following snytax.
d = d.rename(columns = {"old name":"new name"})
where we include inside the rename function a dictionary with “keys” that equal the old column names that we want to rename and “values” the new names associated with each old column.
For example, if we wanted to change the column name “Begin Date” to “Begin” we could write
disasters = disasters.rename(columns = {"Begin Date":"Begin"})
List out all column names#
We can list out all columns names using the column attribute.
disasters.columns
Index(['Name', 'Disaster', 'Begin', 'End Date', 'CPI-Adjusted Cost',
'Unadjusted Cost', 'Deaths', 'Flood'],
dtype='object')
Listing out all the columns gives the researcher a quick way to find out what pieces of information were collected for each observation in the dataset.
Applying a function over rows#
For many tasks, you may want to apply a function to each observation. For example, for our disasters dataframe, we may want to remove the parenthetical from each of the Disaster names.
In Pandas, to iterate over rows we have two options.
Option one is to use the iterrows()
method.
Using iterrows()#
for index, row in disasters.iterrows():
#<---------code to operate on each row>
break #<-- not needed, only to illustrate what is returned from iterrows()
print("Index------")
print(index)
print("Row--------")
print(row)
Index------
0
Row--------
Name Southern Severe Storms and Flooding (April 1980)
Disaster Flooding
Begin 19800410
End Date 19800417
CPI-Adjusted Cost 2742.3
Unadjusted Cost 706.8
Deaths 7
Flood True
Name: 0, dtype: object
We see that when we call iterrows
in a for loop, this method returns for each row in the dataframe, the index name and the row information.
The row is a Series
object.
For example, we can access the Name of the disaster as
row.Name
'Southern Severe Storms and Flooding (April 1980)'
To remove the parentetical from each name, we can use split
to split the string on the symbol “(“.
This will break our string into two pieces. We will need the first one (ie the name).
row.Name.split("(")
['Southern Severe Storms and Flooding ', 'April 1980)']
name,parenthetical = row.Name.split("(")
name = name.strip() #<--this command removes any white space before or after a string
Now that we arrived at our answer for one row, lets compute on every row and store it in a list.
names = [] #empty list
for index, row in disasters.iterrows():
name,parenthetical = row.Name.split("(")
name = name.strip()
names.append(name)
Now that we have a list of all the names, lets add this to our dataframe.
disasters["name"] = names
disasters
Name | Disaster | Begin | End Date | CPI-Adjusted Cost | Unadjusted Cost | Deaths | Flood | name | |
---|---|---|---|---|---|---|---|---|---|
0 | Southern Severe Storms and Flooding (April 1980) | Flooding | 19800410 | 19800417 | 2742.3 | 706.8 | 7 | True | Southern Severe Storms and Flooding |
1 | Hurricane Allen (August 1980) | Tropical Cyclone | 19800807 | 19800811 | 2230.2 | 590.0 | 13 | False | Hurricane Allen |
2 | Central/Eastern Drought/Heat Wave (Summer-Fall... | Drought | 19800601 | 19801130 | 40480.8 | 10020.0 | 1260 | False | Central/Eastern Drought/Heat Wave |
3 | Florida Freeze (January 1981) | Freeze | 19810112 | 19810114 | 2070.6 | 572.0 | 0 | False | Florida Freeze |
4 | Severe Storms, Flash Floods, Hail, Tornadoes (... | Severe Storm | 19810505 | 19810510 | 1405.2 | 401.4 | 20 | False | Severe Storms, Flash Floods, Hail, Tornadoes |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
395 | Hurricane Beryl (July 2024) | Tropical Cyclone | 20240708 | 20240708 | 7219 | 7219.0 | 45 | False | Hurricane Beryl |
396 | Central and Eastern Tornado Outbreak and Sever... | Severe Storm | 20240713 | 20240716 | 2435 | 2435.0 | 2 | False | Central and Eastern Tornado Outbreak and Sever... |
397 | Hurricane Debby (August 2024) | Tropical Cyclone | 20240805 | 20240809 | 2476 | 2476.0 | 10 | False | Hurricane Debby |
398 | Hurricane Helene (September 2024) | Tropical Cyclone | 20240924 | 20240929 | NaN | NaN | 225 | False | Hurricane Helene |
399 | Hurricane Milton (August 2024) | Tropical Cyclone | 20241009 | 20241010 | NaN | NaN | 24 | False | Hurricane Milton |
400 rows × 9 columns
Using apply#
Another option is to define a function (func
) that takes as input a row and then outputs a python object (anything really).
Then, use the apply method to input each row in our dataframe to func
and output the desired result.
For example, we could have decided to write a function that inputs a row and returns a string representing the name of the disaters without parenthetical.
def from_fullname_2_name(row):
name,parenthetical = row.Name.split("(")
name = name.strip()
return name
disasters.apply(from_fullname_2_name,1) #<-- note the option 1. This tells apply to use as input rows. The value 0 would use as input columns.
0 Southern Severe Storms and Flooding
1 Hurricane Allen
2 Central/Eastern Drought/Heat Wave
3 Florida Freeze
4 Severe Storms, Flash Floods, Hail, Tornadoes
...
395 Hurricane Beryl
396 Central and Eastern Tornado Outbreak and Sever...
397 Hurricane Debby
398 Hurricane Helene
399 Hurricane Milton
Length: 400, dtype: object