Chapter 02: Data, Dataframes, and Pandas

Chapter 02: Data, Dataframes, and Pandas #

Experiments, Outcomes, Datapoints, and Dataframes #

We define a set as a collection of items, sometimes called elements. A set is typically given a capital letter (for example \(A\)) and the elements are included inside curly braces.

We use \(\mathcal{G}\) to define the set of all possible outcomes from an experiment and call this set the sample space. The term “experiment” has a broad meaning. An experiment can mean everything from a randomized controlled trial to an observational study. Experiments are what produce observations that we collect and try to characterize. Here an experiment is the process that generates outcomes.

An outcome is defined as an element of the sample space, some result of the experiment that is recorded. An outcome is a single observation from an experiment, and we define an event as a set, or group, of outcomes. Most often we use \(o_{i}\) to denote an outcome and \(E_{i}\) to denote an event.

The sample space, event, and outcome are all potential results from an experiment. When we conduct an experiment we will generate an outcome from our sample space and call this realized outcome a data point.

Example Flipping a coin:
Consider the experiment of flipping a coin and recording whether the coin lands heads or tails side up.
We can define a sample space \(\mathcal{G} = \{H,T\}\) where \(H\) is an outcome that represents the coin landing heads up and \(T\) represents tails up. The sample space includes all the possible events we wish to record, either the coin lands heads or it lands tails.

Up until this point we have structured our experiment, but we have not generated data. We flip the coin and the coin lands tails side up. Now we have performed an experiment and generated the data point \(T\). We flip again and record a heads. \(H\) is the second data point, and so on.

Now suppose that we conduct an experiment with the same sample space \((\mathcal{G})\) a number \(N\) times and with each experiment we record a data point. A tuple (i.e. ordered list) of data points \(d\) is called a data set \(\mathcal{D} = (d_{1}, d_{2}, d_{3}, \cdots, d_{N})\) where \(d_{i}\) is the data point generated from the \(i^\text{th}\) experiment. We say that we have drawn or that we have sampled a data set \(\mathcal{D}\). Further, data points \((d)\) are often called realized outcomes because they are no longer in a set of potential possibilities but are now determined items.

A data set \(\mathcal{D}\) can be unwieldy depending on the number of data points, the complexity of the sample space, or both. A data frame is one way to organize a data set. A data frame \(\mathcal{F}\) is a table where each data point \(d\) in a dataset \(\mathcal{D}\) is represented as a row in the table. Each data point may contains multiple pieces of information. That is, each data point may itself be a tuple. In this case, then a separate column is created for each position in the tuple.

Example Human predictions of infectious diseases:
Suppose we design an experiment to collect from humans predictions, two weeks ahead from the time of our experiment, of the number of incident cases and incident deaths at the US national level of COVID-19 (Source). We decide to collect from each human whether they are an expert in the modeling of infectious disease, a prediction of incident cases, and a prediction of incident deaths. We draw a data set \(\mathcal{D}\) of 50 human judgment predictions. We can organize this data set into a data frame:

Expert	Prediction of cases	Prediction of deaths
Yes	145	52
No	215	34
Yes	524	48
Yes	265	95
No	354	35

Table: Example data frame \(\mathcal{F}\) built from a data set \(\mathcal{D}\) that contains 5 data points where each data point is a tuple of length three.

Above, the first data point is \((\text{Yes},145,52)\), the second data point is \((\text{No}, 215, 34)\), and so on until the last data point \((\text{No}, 354, 35)\). A data frame can also include informative information for others such as labels for each column.

Pandas #

The pandas module in Python is, by far, the most used set of tools for interacting with data, data points, and data frames. The documentation for Pandas is available here = link.

Pandas allows a structured way to import, organize, access, and compute with data frames. But first, we need to discuss the fundemental object in pandas—the Series.

Series #

A Series is (1) a list of items plus (2) an index, a list of string values that are associated with each item in (1).

The typical may to define a Series is by called pd.Series and inputting a list and index. For example, the first two data points of out above coin flip experiment were tails and then heads. Lets assign “tails” the value 0 and “heads” the value 1 to make this more numerically friendly.

import pandas as pd                                     #<--Import pandas (only needed once)
coin_flips = pd.Series([0,1], index=["flip1", "flip2"]) #<--Create a Series
print(coin_flips)                                       # Print this out so we can see what this object looks like

flip1    0
flip2    1
dtype: int64

We see that a series object was created where all the values are integers. This is a rule for Series, they cannot be “mixed” type such as character and integer or integer and floats (decimals values).

The index for our series is displayed on the left side. We can access items in a series using the index like this.

coin_flips.get("flip1")

np.int64(0)

or like this

coin_flips["flip1"]

np.int64(0)

Like most objects in Python, a series is a type of dictionary. The “keys” of the dictionary are the index values and the “values” of the dictionary are the items in the list. In fact, we can build a series from a dictionary.

coin_flips = pd.Series({"flip1":0,"flip2":1})
coin_flips

flip1    0
flip2    1
dtype: int64

Finally, it should be noted that pandas Series objects can, for the most part, be treated the same as numpy arrays. Series support vectorized operations. That is, adding/multiplying/dividing/subtracting a constant \(c\) applys to all items in the series.

coin_flips+2 #<--vectorized addition

flip1    2
flip2    3
dtype: int64

Adding/multiplying/dividing/subtracting two vectors is computed elementwise

coin_flips + (2* coin_flips-1)

flip1   -1
flip2    2
dtype: int64

and applying any function to the series applies this function to each item in the series.

import numpy as np
np.sqrt(np.exp(coin_flips+1))

flip1    1.648721
flip2    2.718282
dtype: float64

Data Frames #

In pandas, data frames are organized the same as our above definition for a data frame. Rows denote observations and each column corresponds to one piece of information about each outcome/observation.

To build a data frame in python from scratch, you can (1) define a dictionary such that the each key is a string that corresponds to a column name and each value is a list where the first item in the list corresponds to information about the first outcome (or observation), the second item in the list corresponds to information about the second observation, and so on. An example is helpful here.

Lets build the above data frame of human judgment predictions of COVID-19. We need three columns: expert, cases, deaths. The expert column has the information (Yes, No, Yes, Yes, No). The cases column has the information (145, 215, 524, 265, 35) and so on.

import pandas as pd
human_judgment_predictions = {"expert"  :["Yes","No","Yes","Yes","No"]
                              , "cases" :[145  ,215 ,524  ,265  ,354]
                              , "deaths":[52   ,34  ,48   ,95   ,35]}
human_judgment_predictions_data_frame = pd.DataFrame(human_judgment_predictions) #<--This converts our dict to a pandas data frame 
print(human_judgment_predictions_data_frame)

  expert  cases  deaths
  Yes    145      52
   No    215      34
  Yes    524      48
  Yes    265      95
   No    354      35

In many instances we will import a csv, excel, or similar type of tabular data into a python pandas data frame. To load in a pandas data frame from a csv file, we can use the pd.read_csv. For an excel file we can use pd.read_excel. There are many “read” options in pandas.

Example Billion-Dollar Weather and Climate Disasters: US National Centers for Environmental Information maintains an inventory of the most costly such disasters in the US.

To load in the data called "events-US-1980-2024.csv" that documents these disasters from 1980 to present, we can write

disasters = pd.read_csv("events-US-1980-2024.csv")
disasters

#--Not important for this lesson. (Lets replace the TBD in these cells with the value NAN)
import numpy as np
disasters = disasters.replace("TBD",np.nan)
disasters["Unadjusted Cost"] = disasters["Unadjusted Cost"].astype(float)

Selecting columns #

To select a column from a dataframe \(D\), we call the dataframe and use brackets to call the column we want to select. Like this.

disasters["Disaster"]

            Flooding
    Tropical Cyclone
             Drought
              Freeze
        Severe Storm
             ...       
  Tropical Cyclone
      Severe Storm
  Tropical Cyclone
  Tropical Cyclone
  Tropical Cyclone
Name: Disaster, Length: 400, dtype: object

Note that the column name is wrapped in quotes, its a string. We can select multiple columns by providing a list of strings.

Below we will select the two columns “Disaster” and “Name” (the name of the disaster).

disasters[["Disaster","Name"]]

	Disaster	Name
0	Flooding	Southern Severe Storms and Flooding (April 1980)
1	Tropical Cyclone	Hurricane Allen (August 1980)
2	Drought	Central/Eastern Drought/Heat Wave (Summer-Fall...
3	Freeze	Florida Freeze (January 1981)
4	Severe Storm	Severe Storms, Flash Floods, Hail, Tornadoes (...
...	...	...
395	Tropical Cyclone	Hurricane Beryl (July 2024)
396	Severe Storm	Central and Eastern Tornado Outbreak and Sever...
397	Tropical Cyclone	Hurricane Debby (August 2024)
398	Tropical Cyclone	Hurricane Helene (September 2024)
399	Tropical Cyclone	Hurricane Milton (August 2024)

400 rows × 2 columns

Selecting rows #

The first way to select rows is by slicing. We can slice rows by providing a range of rows to select from beginning:end. For example, if we want to select the 3rd, 4th, up to the 9th row, we can write

disasters[3:10]

	Name	Disaster	Begin Date	End Date	CPI-Adjusted Cost	Unadjusted Cost	Deaths
3	Florida Freeze (January 1981)	Freeze	19810112	19810114	2070.6	572.0	0
4	Severe Storms, Flash Floods, Hail, Tornadoes (...	Severe Storm	19810505	19810510	1405.2	401.4	20
5	Midwest/Southeast/Northeast Winter Storm, Cold...	Winter Storm	19820108	19820116	2211.1	662.0	85
6	Midwest/Plains/Southeast Tornadoes (April 1982)	Severe Storm	19820402	19820404	1599.5	483.2	33
7	Severe Storms (June 1982)	Severe Storm	19820531	19820610	1574.2	479.9	30
8	Gulf States Storms and Flooding (December 1982...	Flooding	19821201	19830115	4930.7	1536.1	45
9	Western Storms and Flooding (December 1982-Mar...	Flooding	19821213	19830331	4813.8	1499.6	50

More often than not, you will use the .loc attribute to select rows. You can select rows based on their index name

disasters.loc[4]

Name                 Severe Storms, Flash Floods, Hail, Tornadoes (...
Disaster                                                  Severe Storm
Begin Date                                                    19810505
End Date                                                      19810510
CPI-Adjusted Cost                                               1405.2
Unadjusted Cost                                                  401.4
Deaths                                                              20
Name: 4, dtype: object

and, even more important, you can select rows via boolean indexing. Boolean indexing supplies the data frame with a list that has the same length as the number of rows of the data frame that contains the values True and False. The rows corresponding to True values are returned.

Creating boolean lists based on our data frame is easy. For example, suppose we cant to identify (with True and False) the disasters that results in more than 20 deaths. We can write the following expression

disasters["Deaths"] > 20

    False
    False
     True
    False
    False
       ...  
   True
  False
  False
   True
   True
Name: Deaths, Length: 400, dtype: bool

This Boolean list can be used to select all the rows with more than 20 deaths like

disasters.loc[disasters["Deaths"] > 20]

	Name	Disaster	Begin Date	End Date	CPI-Adjusted Cost	Unadjusted Cost	Deaths
2	Central/Eastern Drought/Heat Wave (Summer-Fall...	Drought	19800601	19801130	40480.8	10020.0	1260
5	Midwest/Southeast/Northeast Winter Storm, Cold...	Winter Storm	19820108	19820116	2211.1	662.0	85
6	Midwest/Plains/Southeast Tornadoes (April 1982)	Severe Storm	19820402	19820404	1599.5	483.2	33
7	Severe Storms (June 1982)	Severe Storm	19820531	19820610	1574.2	479.9	30
8	Gulf States Storms and Flooding (December 1982...	Flooding	19821201	19830115	4930.7	1536.1	45
...	...	...	...	...	...	...	...
374	Southern/Midwestern Drought and Heatwave (Spri...	Drought	20230401	20230930	14786.7	14082.0	247
378	Central, Southern, Northeastern Winter Storm a...	Winter Storm	20240114	20240118	1948.3	1910.0	41
395	Hurricane Beryl (July 2024)	Tropical Cyclone	20240708	20240708	7219	7219.0	45
398	Hurricane Helene (September 2024)	Tropical Cyclone	20240924	20240929	NaN	NaN	225
399	Hurricane Milton (August 2024)	Tropical Cyclone	20241009	20241010	NaN	NaN	24

115 rows × 7 columns

We can select both rows and columns using the .loc attribute as well. We can use the syntax df.loc[ <row selection>, <column selection> ] For example, lets select rows with disasters with an unadjusted cost of more than 20000 billion and for columns we’ll select the name and type of disaster.

disasters.loc[disasters["Unadjusted Cost"]>20000 , ["Name","Disaster"]]

	Name	Disaster
26	U.S. Drought/Heat Wave (Summer 1988)	Drought
44	Hurricane Andrew (August 1992)	Tropical Cyclone
50	Midwest Flooding (Summer 1993)	Flooding
115	Hurricane Ivan (September 2004)	Tropical Cyclone
119	Hurricane Katrina (August 2005)	Tropical Cyclone
145	Hurricane Ike (September 2008)	Tropical Cyclone
190	Hurricane Sandy (October 2012)	Tropical Cyclone
192	U.S. Drought/Heat Wave (2012)	Drought
253	Hurricane Harvey (August 2017)	Tropical Cyclone
254	Hurricane Irma (September 2017)	Tropical Cyclone
255	Hurricane Maria (September 2017)	Tropical Cyclone
270	Hurricane Florence (September 2018)	Tropical Cyclone
271	Hurricane Michael (October 2018)	Tropical Cyclone
273	Western Wildfires, California Firestorm (Summe...	Wildfire
303	Hurricane Laura (August 2020)	Tropical Cyclone
311	Northwest, Central, Eastern Winter Storm and C...	Winter Storm
324	Hurricane Ida (August 2021)	Tropical Cyclone
343	Hurricane Ian (September 2022)	Tropical Cyclone
347	Western/Central Drought and Heat Wave (2022)	Drought

Adding new columns #

There are several different ways to add columns to a pandas data frame. The most straightforward way to add a column is to call the dataframe with a bracketed string and assign this the column values.

For example, suppose we want to add a column to the above disasters data frame. This new column is going to be the number of deaths minus the mean number of deaths as a result of the disaster.

disasters.assign( Flood = lambda x: x["Disaster"].str.contains("Flood") )

	Name	Disaster	Begin Date	End Date	CPI-Adjusted Cost	Unadjusted Cost	Deaths	Flood
0	Southern Severe Storms and Flooding (April 1980)	Flooding	19800410	19800417	2742.3	706.8	7	True
1	Hurricane Allen (August 1980)	Tropical Cyclone	19800807	19800811	2230.2	590.0	13	False
2	Central/Eastern Drought/Heat Wave (Summer-Fall...	Drought	19800601	19801130	40480.8	10020.0	1260	False
3	Florida Freeze (January 1981)	Freeze	19810112	19810114	2070.6	572.0	0	False
4	Severe Storms, Flash Floods, Hail, Tornadoes (...	Severe Storm	19810505	19810510	1405.2	401.4	20	False
...	...	...	...	...	...	...	...	...
395	Hurricane Beryl (July 2024)	Tropical Cyclone	20240708	20240708	7219	7219.0	45	False
396	Central and Eastern Tornado Outbreak and Sever...	Severe Storm	20240713	20240716	2435	2435.0	2	False
397	Hurricane Debby (August 2024)	Tropical Cyclone	20240805	20240809	2476	2476.0	10	False
398	Hurricane Helene (September 2024)	Tropical Cyclone	20240924	20240929	NaN	NaN	225	False
399	Hurricane Milton (August 2024)	Tropical Cyclone	20241009	20241010	NaN	NaN	24	False

400 rows × 8 columns

Its important to note that the assign function returns the entire dataframe. This means that if we want to add the column “Flood” to our original dataframe we need to write.

disasters = disasters.assign( Flood = lambda x: x["Disaster"].str.contains("Flood") )
disasters

	Name	Disaster	Begin Date	End Date	CPI-Adjusted Cost	Unadjusted Cost	Deaths	Flood
0	Southern Severe Storms and Flooding (April 1980)	Flooding	19800410	19800417	2742.3	706.8	7	True
1	Hurricane Allen (August 1980)	Tropical Cyclone	19800807	19800811	2230.2	590.0	13	False
2	Central/Eastern Drought/Heat Wave (Summer-Fall...	Drought	19800601	19801130	40480.8	10020.0	1260	False
3	Florida Freeze (January 1981)	Freeze	19810112	19810114	2070.6	572.0	0	False
4	Severe Storms, Flash Floods, Hail, Tornadoes (...	Severe Storm	19810505	19810510	1405.2	401.4	20	False
...	...	...	...	...	...	...	...	...
395	Hurricane Beryl (July 2024)	Tropical Cyclone	20240708	20240708	7219	7219.0	45	False
396	Central and Eastern Tornado Outbreak and Sever...	Severe Storm	20240713	20240716	2435	2435.0	2	False
397	Hurricane Debby (August 2024)	Tropical Cyclone	20240805	20240809	2476	2476.0	10	False
398	Hurricane Helene (September 2024)	Tropical Cyclone	20240924	20240929	NaN	NaN	225	False
399	Hurricane Milton (August 2024)	Tropical Cyclone	20241009	20241010	NaN	NaN	24	False

400 rows × 8 columns

In the above we used a special attribute, str in pandas. We’ll learn more about how to deal with strings in next week’s lesson.

Renaming columns #

To rename a column in our pandas dataframe, we can use the following snytax.

d = d.rename(columns = {"old name":"new name"})

where we include inside the rename function a dictionary with “keys” that equal the old column names that we want to rename and “values” the new names associated with each old column.

For example, if we wanted to change the column name “Begin Date” to “Begin” we could write

disasters = disasters.rename(columns = {"Begin Date":"Begin"})

List out all column names #

We can list out all columns names using the column attribute.

disasters.columns

Index(['Name', 'Disaster', 'Begin', 'End Date', 'CPI-Adjusted Cost',
       'Unadjusted Cost', 'Deaths', 'Flood'],
      dtype='object')

Listing out all the columns gives the researcher a quick way to find out what pieces of information were collected for each observation in the dataset.

Applying a function over rows #

For many tasks, you may want to apply a function to each observation. For example, for our disasters dataframe, we may want to remove the parenthetical from each of the Disaster names.

In Pandas, to iterate over rows we have two options.

Option one is to use the iterrows() method.

Using iterrows()#

for index, row in disasters.iterrows():
    #<---------code to operate on each row>
    break #<-- not needed, only to illustrate what is returned from iterrows()

print("Index------")
print(index)

print("Row--------")
print(row)

Index------
0
Row--------
Name                 Southern Severe Storms and Flooding (April 1980)
Disaster                                                     Flooding
Begin                                                        19800410
End Date                                                     19800417
CPI-Adjusted Cost                                              2742.3
Unadjusted Cost                                                 706.8
Deaths                                                              7
Flood                                                            True
Name: 0, dtype: object

We see that when we call iterrows in a for loop, this method returns for each row in the dataframe, the index name and the row information. The row is a Series object. For example, we can access the Name of the disaster as

row.Name

'Southern Severe Storms and Flooding (April 1980)'

To remove the parentetical from each name, we can use split to split the string on the symbol “(“. This will break our string into two pieces. We will need the first one (ie the name).

row.Name.split("(")

['Southern Severe Storms and Flooding ', 'April 1980)']

name,parenthetical = row.Name.split("(")
name = name.strip() #<--this command removes any white space before or after a string 

Now that we arrived at our answer for one row, lets compute on every row and store it in a list.

names = [] #empty list 
for index, row in disasters.iterrows():
    name,parenthetical = row.Name.split("(")
    name = name.strip() 
    names.append(name)

Now that we have a list of all the names, lets add this to our dataframe.

disasters["name"] = names

disasters

	Name	Disaster	Begin	End Date	CPI-Adjusted Cost	Unadjusted Cost	Deaths	Flood	name
0	Southern Severe Storms and Flooding (April 1980)	Flooding	19800410	19800417	2742.3	706.8	7	True	Southern Severe Storms and Flooding
1	Hurricane Allen (August 1980)	Tropical Cyclone	19800807	19800811	2230.2	590.0	13	False	Hurricane Allen
2	Central/Eastern Drought/Heat Wave (Summer-Fall...	Drought	19800601	19801130	40480.8	10020.0	1260	False	Central/Eastern Drought/Heat Wave
3	Florida Freeze (January 1981)	Freeze	19810112	19810114	2070.6	572.0	0	False	Florida Freeze
4	Severe Storms, Flash Floods, Hail, Tornadoes (...	Severe Storm	19810505	19810510	1405.2	401.4	20	False	Severe Storms, Flash Floods, Hail, Tornadoes
...	...	...	...	...	...	...	...	...	...
395	Hurricane Beryl (July 2024)	Tropical Cyclone	20240708	20240708	7219	7219.0	45	False	Hurricane Beryl
396	Central and Eastern Tornado Outbreak and Sever...	Severe Storm	20240713	20240716	2435	2435.0	2	False	Central and Eastern Tornado Outbreak and Sever...
397	Hurricane Debby (August 2024)	Tropical Cyclone	20240805	20240809	2476	2476.0	10	False	Hurricane Debby
398	Hurricane Helene (September 2024)	Tropical Cyclone	20240924	20240929	NaN	NaN	225	False	Hurricane Helene
399	Hurricane Milton (August 2024)	Tropical Cyclone	20241009	20241010	NaN	NaN	24	False	Hurricane Milton

400 rows × 9 columns

Using apply#

Another option is to define a function (func) that takes as input a row and then outputs a python object (anything really). Then, use the apply method to input each row in our dataframe to func and output the desired result.

For example, we could have decided to write a function that inputs a row and returns a string representing the name of the disaters without parenthetical.

def from_fullname_2_name(row):
    name,parenthetical = row.Name.split("(")
    name = name.strip() 
    return name 
disasters.apply(from_fullname_2_name,1) #<-- note the option 1. This tells apply to use as input rows. The value 0 would use as input columns.

                  Southern Severe Storms and Flooding
                                      Hurricane Allen
                    Central/Eastern Drought/Heat Wave
                                       Florida Freeze
         Severe Storms, Flash Floods, Hail, Tornadoes
                             ...                        
                                    Hurricane Beryl
  Central and Eastern Tornado Outbreak and Sever...
                                    Hurricane Debby
                                   Hurricane Helene
                                   Hurricane Milton
Length: 400, dtype: object

Chapter 02: Data, Dataframes, and Pandas

Contents

Using iterrows()#

Using apply#