Chapter 02: Data, Dataframes, and Pandas#

Experiments, Outcomes, Datapoints, and Dataframes#

We define a set as a collection of items, sometimes called elements. A set is typically given a capital letter (for example \(A\)) and the elements are included inside curly braces.

We use \(\mathcal{G}\) to define the set of all possible outcomes from an experiment and call this set the sample space. The term “experiment” has a broad meaning. An experiment can mean everything from a randomized controlled trial to an observational study. Experiments are what produce observations that we collect and try to characterize. Here an experiment is the process that generates outcomes.

An outcome is defined as an element of the sample space, some result of the experiment that is recorded. An outcome is a single observation from an experiment, and we define an event as a set, or group, of outcomes. Most often we use \(o_{i}\) to denote an outcome and \(E_{i}\) to denote an event.

The sample space, event, and outcome are all potential results from an experiment. When we conduct an experiment we will generate an outcome from our sample space and call this realized outcome a data point.


Example Flipping a coin:
Consider the experiment of flipping a coin and recording whether the coin lands heads or tails side up.
We can define a sample space \(\mathcal{G} = \{H,T\}\) where \(H\) is an outcome that represents the coin landing heads up and \(T\) represents tails up. The sample space includes all the possible events we wish to record, either the coin lands heads or it lands tails.

Up until this point we have structured our experiment, but we have not generated data. We flip the coin and the coin lands tails side up. Now we have performed an experiment and generated the data point \(T\). We flip again and record a heads. \(H\) is the second data point, and so on.


Now suppose that we conduct an experiment with the same sample space \((\mathcal{G})\) a number \(N\) times and with each experiment we record a data point. A tuple (i.e. ordered list) of data points \(d\) is called a data set \(\mathcal{D} = (d_{1}, d_{2}, d_{3}, \cdots, d_{N})\) where \(d_{i}\) is the data point generated from the \(i^\text{th}\) experiment. We say that we have drawn or that we have sampled a data set \(\mathcal{D}\). Further, data points \((d)\) are often called realized outcomes because they are no longer in a set of potential possibilities but are now determined items.

A data set \(\mathcal{D}\) can be unwieldy depending on the number of data points, the complexity of the sample space, or both. A data frame is one way to organize a data set. A data frame \(\mathcal{F}\) is a table where each data point \(d\) in a dataset \(\mathcal{D}\) is represented as a row in the table. Each data point may contains multiple pieces of information. That is, each data point may itself be a tuple. In this case, then a separate column is created for each position in the tuple.


Example Human predictions of infectious diseases:
Suppose we design an experiment to collect from humans predictions, two weeks ahead from the time of our experiment, of the number of incident cases and incident deaths at the US national level of COVID-19 (Source). We decide to collect from each human whether they are an expert in the modeling of infectious disease, a prediction of incident cases, and a prediction of incident deaths. We draw a data set \(\mathcal{D}\) of 50 human judgment predictions. We can organize this data set into a data frame:

Expert

Prediction of cases

Prediction of deaths

Yes

145

52

No

215

34

Yes

524

48

Yes

265

95

No

354

35

Table: Example data frame \(\mathcal{F}\) built from a data set \(\mathcal{D}\) that contains 5 data points where each data point is a tuple of length three.


Above, the first data point is \((\text{Yes},145,52)\), the second data point is \((\text{No}, 215, 34)\), and so on until the last data point \((\text{No}, 354, 35)\). A data frame can also include informative information for others such as labels for each column.

Pandas#

The pandas module in Python is, by far, the most used set of tools for interacting with data, data points, and data frames. The documentation for Pandas is available here = link.

Pandas allows a structured way to import, organize, access, and compute with data frames. But first, we need to discuss the fundemental object in pandas—the Series.

Series#

A Series is (1) a list of items plus (2) an index, a list of string values that are associated with each item in (1).

The typical may to define a Series is by called pd.Series and inputting a list and index. For example, the first two data points of out above coin flip experiment were tails and then heads. Lets assign “tails” the value 0 and “heads” the value 1 to make this more numerically friendly.

import pandas as pd                                     #<--Import pandas (only needed once)
coin_flips = pd.Series([0,1], index=["flip1", "flip2"]) #<--Create a Series
print(coin_flips)                                       # Print this out so we can see what this object looks like
flip1    0
flip2    1
dtype: int64

We see that a series object was created where all the values are integers. This is a rule for Series, they cannot be “mixed” type such as character and integer or integer and floats (decimals values).

The index for our series is displayed on the left side. We can access items in a series using the index like this.

coin_flips.get("flip1")
np.int64(0)

or like this

coin_flips["flip1"]
np.int64(0)

Like most objects in Python, a series is a type of dictionary. The “keys” of the dictionary are the index values and the “values” of the dictionary are the items in the list. In fact, we can build a series from a dictionary.

coin_flips = pd.Series({"flip1":0,"flip2":1})
coin_flips
flip1    0
flip2    1
dtype: int64

Finally, it should be noted that pandas Series objects can, for the most part, be treated the same as numpy arrays. Series support vectorized operations. That is, adding/multiplying/dividing/subtracting a constant \(c\) applys to all items in the series.

coin_flips+2 #<--vectorized addition
flip1    2
flip2    3
dtype: int64

Adding/multiplying/dividing/subtracting two vectors is computed elementwise

coin_flips + (2* coin_flips-1)
flip1   -1
flip2    2
dtype: int64

and applying any function to the series applies this function to each item in the series.

import numpy as np
np.sqrt(np.exp(coin_flips+1))
flip1    1.648721
flip2    2.718282
dtype: float64

Data Frames#

In pandas, data frames are organized the same as our above definition for a data frame. Rows denote observations and each column corresponds to one piece of information about each outcome/observation.

To build a data frame in python from scratch, you can (1) define a dictionary such that the each key is a string that corresponds to a column name and each value is a list where the first item in the list corresponds to information about the first outcome (or observation), the second item in the list corresponds to information about the second observation, and so on. An example is helpful here.

Lets build the above data frame of human judgment predictions of COVID-19. We need three columns: expert, cases, deaths. The expert column has the information (Yes, No, Yes, Yes, No). The cases column has the information (145, 215, 524, 265, 35) and so on.

import pandas as pd
human_judgment_predictions = {"expert"  :["Yes","No","Yes","Yes","No"]
                              , "cases" :[145  ,215 ,524  ,265  ,354]
                              , "deaths":[52   ,34  ,48   ,95   ,35]}
human_judgment_predictions_data_frame = pd.DataFrame(human_judgment_predictions) #<--This converts our dict to a pandas data frame 
print(human_judgment_predictions_data_frame)
  expert  cases  deaths
0    Yes    145      52
1     No    215      34
2    Yes    524      48
3    Yes    265      95
4     No    354      35

In many instances we will import a csv, excel, or similar type of tabular data into a python pandas data frame. To load in a pandas data frame from a csv file, we can use the pd.read_csv. For an excel file we can use pd.read_excel. There are many “read” options in pandas.


Example Billion-Dollar Weather and Climate Disasters: US National Centers for Environmental Information maintains an inventory of the most costly such disasters in the US.

To load in the data called "events-US-1980-2024.csv" that documents these disasters from 1980 to present, we can write

disasters = pd.read_csv("events-US-1980-2024.csv")
disasters

#--Not important for this lesson. (Lets replace the TBD in these cells with the value NAN)
import numpy as np
disasters = disasters.replace("TBD",np.nan)
disasters["Unadjusted Cost"] = disasters["Unadjusted Cost"].astype(float)

Selecting columns#

To select a column from a dataframe \(D\), we call the dataframe and use brackets to call the column we want to select. Like this.

disasters["Disaster"]
0              Flooding
1      Tropical Cyclone
2               Drought
3                Freeze
4          Severe Storm
             ...       
395    Tropical Cyclone
396        Severe Storm
397    Tropical Cyclone
398    Tropical Cyclone
399    Tropical Cyclone
Name: Disaster, Length: 400, dtype: object

Note that the column name is wrapped in quotes, its a string. We can select multiple columns by providing a list of strings.

Below we will select the two columns “Disaster” and “Name” (the name of the disaster).

disasters[["Disaster","Name"]]
Disaster Name
0 Flooding Southern Severe Storms and Flooding (April 1980)
1 Tropical Cyclone Hurricane Allen (August 1980)
2 Drought Central/Eastern Drought/Heat Wave (Summer-Fall...
3 Freeze Florida Freeze (January 1981)
4 Severe Storm Severe Storms, Flash Floods, Hail, Tornadoes (...
... ... ...
395 Tropical Cyclone Hurricane Beryl (July 2024)
396 Severe Storm Central and Eastern Tornado Outbreak and Sever...
397 Tropical Cyclone Hurricane Debby (August 2024)
398 Tropical Cyclone Hurricane Helene (September 2024)
399 Tropical Cyclone Hurricane Milton (August 2024)

400 rows × 2 columns

Selecting rows#

The first way to select rows is by slicing. We can slice rows by providing a range of rows to select from beginning:end. For example, if we want to select the 3rd, 4th, up to the 9th row, we can write

disasters[3:10]
Name Disaster Begin Date End Date CPI-Adjusted Cost Unadjusted Cost Deaths
3 Florida Freeze (January 1981) Freeze 19810112 19810114 2070.6 572.0 0
4 Severe Storms, Flash Floods, Hail, Tornadoes (... Severe Storm 19810505 19810510 1405.2 401.4 20
5 Midwest/Southeast/Northeast Winter Storm, Cold... Winter Storm 19820108 19820116 2211.1 662.0 85
6 Midwest/Plains/Southeast Tornadoes (April 1982) Severe Storm 19820402 19820404 1599.5 483.2 33
7 Severe Storms (June 1982) Severe Storm 19820531 19820610 1574.2 479.9 30
8 Gulf States Storms and Flooding (December 1982... Flooding 19821201 19830115 4930.7 1536.1 45
9 Western Storms and Flooding (December 1982-Mar... Flooding 19821213 19830331 4813.8 1499.6 50

More often than not, you will use the .loc attribute to select rows. You can select rows based on their index name

disasters.loc[4]
Name                 Severe Storms, Flash Floods, Hail, Tornadoes (...
Disaster                                                  Severe Storm
Begin Date                                                    19810505
End Date                                                      19810510
CPI-Adjusted Cost                                               1405.2
Unadjusted Cost                                                  401.4
Deaths                                                              20
Name: 4, dtype: object

and, even more important, you can select rows via boolean indexing. Boolean indexing supplies the data frame with a list that has the same length as the number of rows of the data frame that contains the values True and False. The rows corresponding to True values are returned.

Creating boolean lists based on our data frame is easy. For example, suppose we cant to identify (with True and False) the disasters that results in more than 20 deaths. We can write the following expression

disasters["Deaths"] > 20
0      False
1      False
2       True
3      False
4      False
       ...  
395     True
396    False
397    False
398     True
399     True
Name: Deaths, Length: 400, dtype: bool

This Boolean list can be used to select all the rows with more than 20 deaths like

disasters.loc[disasters["Deaths"] > 20]
Name Disaster Begin Date End Date CPI-Adjusted Cost Unadjusted Cost Deaths
2 Central/Eastern Drought/Heat Wave (Summer-Fall... Drought 19800601 19801130 40480.8 10020.0 1260
5 Midwest/Southeast/Northeast Winter Storm, Cold... Winter Storm 19820108 19820116 2211.1 662.0 85
6 Midwest/Plains/Southeast Tornadoes (April 1982) Severe Storm 19820402 19820404 1599.5 483.2 33
7 Severe Storms (June 1982) Severe Storm 19820531 19820610 1574.2 479.9 30
8 Gulf States Storms and Flooding (December 1982... Flooding 19821201 19830115 4930.7 1536.1 45
... ... ... ... ... ... ... ...
374 Southern/Midwestern Drought and Heatwave (Spri... Drought 20230401 20230930 14786.7 14082.0 247
378 Central, Southern, Northeastern Winter Storm a... Winter Storm 20240114 20240118 1948.3 1910.0 41
395 Hurricane Beryl (July 2024) Tropical Cyclone 20240708 20240708 7219 7219.0 45
398 Hurricane Helene (September 2024) Tropical Cyclone 20240924 20240929 NaN NaN 225
399 Hurricane Milton (August 2024) Tropical Cyclone 20241009 20241010 NaN NaN 24

115 rows × 7 columns

We can select both rows and columns using the .loc attribute as well. We can use the syntax df.loc[ <row selection>, <column selection>  ] For example, lets select rows with disasters with an unadjusted cost of more than 20000 billion and for columns we’ll select the name and type of disaster.

disasters.loc[disasters["Unadjusted Cost"]>20000 , ["Name","Disaster"]]
Name Disaster
26 U.S. Drought/Heat Wave (Summer 1988) Drought
44 Hurricane Andrew (August 1992) Tropical Cyclone
50 Midwest Flooding (Summer 1993) Flooding
115 Hurricane Ivan (September 2004) Tropical Cyclone
119 Hurricane Katrina (August 2005) Tropical Cyclone
145 Hurricane Ike (September 2008) Tropical Cyclone
190 Hurricane Sandy (October 2012) Tropical Cyclone
192 U.S. Drought/Heat Wave (2012) Drought
253 Hurricane Harvey (August 2017) Tropical Cyclone
254 Hurricane Irma (September 2017) Tropical Cyclone
255 Hurricane Maria (September 2017) Tropical Cyclone
270 Hurricane Florence (September 2018) Tropical Cyclone
271 Hurricane Michael (October 2018) Tropical Cyclone
273 Western Wildfires, California Firestorm (Summe... Wildfire
303 Hurricane Laura (August 2020) Tropical Cyclone
311 Northwest, Central, Eastern Winter Storm and C... Winter Storm
324 Hurricane Ida (August 2021) Tropical Cyclone
343 Hurricane Ian (September 2022) Tropical Cyclone
347 Western/Central Drought and Heat Wave (2022) Drought

Adding new columns#

There are several different ways to add columns to a pandas data frame. The most straightforward way to add a column is to call the dataframe with a bracketed string and assign this the column values.

For example, suppose we want to add a column to the above disasters data frame. This new column is going to be the number of deaths minus the mean number of deaths as a result of the disaster.

disasters.assign( Flood = lambda x: x["Disaster"].str.contains("Flood") )
Name Disaster Begin Date End Date CPI-Adjusted Cost Unadjusted Cost Deaths Flood
0 Southern Severe Storms and Flooding (April 1980) Flooding 19800410 19800417 2742.3 706.8 7 True
1 Hurricane Allen (August 1980) Tropical Cyclone 19800807 19800811 2230.2 590.0 13 False
2 Central/Eastern Drought/Heat Wave (Summer-Fall... Drought 19800601 19801130 40480.8 10020.0 1260 False
3 Florida Freeze (January 1981) Freeze 19810112 19810114 2070.6 572.0 0 False
4 Severe Storms, Flash Floods, Hail, Tornadoes (... Severe Storm 19810505 19810510 1405.2 401.4 20 False
... ... ... ... ... ... ... ... ...
395 Hurricane Beryl (July 2024) Tropical Cyclone 20240708 20240708 7219 7219.0 45 False
396 Central and Eastern Tornado Outbreak and Sever... Severe Storm 20240713 20240716 2435 2435.0 2 False
397 Hurricane Debby (August 2024) Tropical Cyclone 20240805 20240809 2476 2476.0 10 False
398 Hurricane Helene (September 2024) Tropical Cyclone 20240924 20240929 NaN NaN 225 False
399 Hurricane Milton (August 2024) Tropical Cyclone 20241009 20241010 NaN NaN 24 False

400 rows × 8 columns

Its important to note that the assign function returns the entire dataframe. This means that if we want to add the column “Flood” to our original dataframe we need to write.

disasters = disasters.assign( Flood = lambda x: x["Disaster"].str.contains("Flood") )
disasters
Name Disaster Begin Date End Date CPI-Adjusted Cost Unadjusted Cost Deaths Flood
0 Southern Severe Storms and Flooding (April 1980) Flooding 19800410 19800417 2742.3 706.8 7 True
1 Hurricane Allen (August 1980) Tropical Cyclone 19800807 19800811 2230.2 590.0 13 False
2 Central/Eastern Drought/Heat Wave (Summer-Fall... Drought 19800601 19801130 40480.8 10020.0 1260 False
3 Florida Freeze (January 1981) Freeze 19810112 19810114 2070.6 572.0 0 False
4 Severe Storms, Flash Floods, Hail, Tornadoes (... Severe Storm 19810505 19810510 1405.2 401.4 20 False
... ... ... ... ... ... ... ... ...
395 Hurricane Beryl (July 2024) Tropical Cyclone 20240708 20240708 7219 7219.0 45 False
396 Central and Eastern Tornado Outbreak and Sever... Severe Storm 20240713 20240716 2435 2435.0 2 False
397 Hurricane Debby (August 2024) Tropical Cyclone 20240805 20240809 2476 2476.0 10 False
398 Hurricane Helene (September 2024) Tropical Cyclone 20240924 20240929 NaN NaN 225 False
399 Hurricane Milton (August 2024) Tropical Cyclone 20241009 20241010 NaN NaN 24 False

400 rows × 8 columns

In the above we used a special attribute, str in pandas. We’ll learn more about how to deal with strings in next week’s lesson.

Renaming columns#

To rename a column in our pandas dataframe, we can use the following snytax.

d = d.rename(columns = {"old name":"new name"})

where we include inside the rename function a dictionary with “keys” that equal the old column names that we want to rename and “values” the new names associated with each old column.

For example, if we wanted to change the column name “Begin Date” to “Begin” we could write

disasters = disasters.rename(columns = {"Begin Date":"Begin"})

List out all column names#

We can list out all columns names using the column attribute.

disasters.columns
Index(['Name', 'Disaster', 'Begin', 'End Date', 'CPI-Adjusted Cost',
       'Unadjusted Cost', 'Deaths', 'Flood'],
      dtype='object')

Listing out all the columns gives the researcher a quick way to find out what pieces of information were collected for each observation in the dataset.

Applying a function over rows#

For many tasks, you may want to apply a function to each observation. For example, for our disasters dataframe, we may want to remove the parenthetical from each of the Disaster names.

In Pandas, to iterate over rows we have two options.

Option one is to use the iterrows() method.

Using iterrows()#

for index, row in disasters.iterrows():
    #<---------code to operate on each row>
    break #<-- not needed, only to illustrate what is returned from iterrows()

print("Index------")
print(index)

print("Row--------")
print(row)
Index------
0
Row--------
Name                 Southern Severe Storms and Flooding (April 1980)
Disaster                                                     Flooding
Begin                                                        19800410
End Date                                                     19800417
CPI-Adjusted Cost                                              2742.3
Unadjusted Cost                                                 706.8
Deaths                                                              7
Flood                                                            True
Name: 0, dtype: object

We see that when we call iterrows in a for loop, this method returns for each row in the dataframe, the index name and the row information. The row is a Series object. For example, we can access the Name of the disaster as

row.Name
'Southern Severe Storms and Flooding (April 1980)'

To remove the parentetical from each name, we can use split to split the string on the symbol “(“. This will break our string into two pieces. We will need the first one (ie the name).

row.Name.split("(")
['Southern Severe Storms and Flooding ', 'April 1980)']
name,parenthetical = row.Name.split("(")
name = name.strip() #<--this command removes any white space before or after a string 

Now that we arrived at our answer for one row, lets compute on every row and store it in a list.

names = [] #empty list 
for index, row in disasters.iterrows():
    name,parenthetical = row.Name.split("(")
    name = name.strip() 
    names.append(name)

Now that we have a list of all the names, lets add this to our dataframe.

disasters["name"] = names
disasters
Name Disaster Begin End Date CPI-Adjusted Cost Unadjusted Cost Deaths Flood name
0 Southern Severe Storms and Flooding (April 1980) Flooding 19800410 19800417 2742.3 706.8 7 True Southern Severe Storms and Flooding
1 Hurricane Allen (August 1980) Tropical Cyclone 19800807 19800811 2230.2 590.0 13 False Hurricane Allen
2 Central/Eastern Drought/Heat Wave (Summer-Fall... Drought 19800601 19801130 40480.8 10020.0 1260 False Central/Eastern Drought/Heat Wave
3 Florida Freeze (January 1981) Freeze 19810112 19810114 2070.6 572.0 0 False Florida Freeze
4 Severe Storms, Flash Floods, Hail, Tornadoes (... Severe Storm 19810505 19810510 1405.2 401.4 20 False Severe Storms, Flash Floods, Hail, Tornadoes
... ... ... ... ... ... ... ... ... ...
395 Hurricane Beryl (July 2024) Tropical Cyclone 20240708 20240708 7219 7219.0 45 False Hurricane Beryl
396 Central and Eastern Tornado Outbreak and Sever... Severe Storm 20240713 20240716 2435 2435.0 2 False Central and Eastern Tornado Outbreak and Sever...
397 Hurricane Debby (August 2024) Tropical Cyclone 20240805 20240809 2476 2476.0 10 False Hurricane Debby
398 Hurricane Helene (September 2024) Tropical Cyclone 20240924 20240929 NaN NaN 225 False Hurricane Helene
399 Hurricane Milton (August 2024) Tropical Cyclone 20241009 20241010 NaN NaN 24 False Hurricane Milton

400 rows × 9 columns

Using apply#

Another option is to define a function (func) that takes as input a row and then outputs a python object (anything really). Then, use the apply method to input each row in our dataframe to func and output the desired result.

For example, we could have decided to write a function that inputs a row and returns a string representing the name of the disaters without parenthetical.

def from_fullname_2_name(row):
    name,parenthetical = row.Name.split("(")
    name = name.strip() 
    return name 
disasters.apply(from_fullname_2_name,1) #<-- note the option 1. This tells apply to use as input rows. The value 0 would use as input columns.
0                    Southern Severe Storms and Flooding
1                                        Hurricane Allen
2                      Central/Eastern Drought/Heat Wave
3                                         Florida Freeze
4           Severe Storms, Flash Floods, Hail, Tornadoes
                             ...                        
395                                      Hurricane Beryl
396    Central and Eastern Tornado Outbreak and Sever...
397                                      Hurricane Debby
398                                     Hurricane Helene
399                                     Hurricane Milton
Length: 400, dtype: object