5 Grouping

In this section we will go over the split-apply-combine strategy and the groupby() function.

This lesson is based on the R lesson on summary statistics using group-by and summarize [1] co-developed at the NCEAS Learning Hub.

Learning objectives

By the end of this lesson students will be able to:

Understand and apply the Split-Apply-Combine strategy to analyze grouped data.
Use groupby() to split a pandas.DataFrame by one or more columns.
Calculate summary statistics for groups in a pandas.DataFrame.
Use method chaining for efficient data analysis.

About the data

For this section we will use the Palmer Penguins dataset [2] developed by Drs. Allison Horst, Alison Hill and Kristen Gorman. This dataset contains size measurements for three penguin species in the Palmer Archipelago, Antarctica during 2007, 2008, and 2009.

The Palmer Archipelago penguins. Artwork by Dr. Allison Horst.

The dataset has 344 rows and 8 columns. Let’s start by loading the data:

import numpy as np
import pandas as pd

# Load Palmer penguins data
URL = 'https://raw.githubusercontent.com/allisonhorst/palmerpenguins/main/inst/extdata/penguins.csv'
penguins = pd.read_csv(URL)

penguins.head()

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	male	2007
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	female	2007
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	female	2007
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN	2007
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	female	2007

Summary statistics

It is easy to get summary statistics for each column in a pandas.DataFrame by using methods such as

sum(): sum values in each column,
count(): count non-NA values in each column,
min() and max(): get the minimum and maximum value in each column,
mean() and median(): get the mean and median value in each column,
std() and var(): get the standard deviation and variance in each column.

Example

# Get the number of non-NA values in each column 
penguins.count()

species              344
island               344
bill_length_mm       342
bill_depth_mm        342
flipper_length_mm    342
body_mass_g          342
sex                  333
year                 344
dtype: int64

# Get minimum value in each column with numerical values
penguins.select_dtypes('number').min()

bill_length_mm         32.1
bill_depth_mm          13.1
flipper_length_mm     172.0
body_mass_g          2700.0
year                 2007.0
dtype: float64

Grouping

Our penguins data is naturally split into different groups: there are three different species, two sexes, and three islands. Often, we want to calculate a certain statistic for each group. For example, suppose we want to calculate the average flipper length per species. How would we do this “by hand”?

We start with our data and notice there are multiple species in the species column.
We split our original table to group all observations from the same species together.
We calculate the average flipper length for each of the groups we formed.
Then we combine the values for average flipper length per species into a single table.

This is known as the Split-Apply-Combine strategy. This strategy follows the three steps we explained above:

Split: Split the data into logical groups (e.g. species, sex, island, etc.)
Apply: Calculate some summary statistic on each group (e.g. average flipper length by species, number of individuals per island, body mass by sex, etc.)
Combine: Combine the statistic calculated on each group back together.

Split-apply-combine to calculate mean flipper length

For a pandas.DataFrame or pandas.Series, we can use the groupby() method to split (i.e. group) the data into different categories.

The general syntax for groupby() is

df.groupby(columns_to_group_by).summary_method()

Most often, we will have that columns_to_group_by will be a single column name (a string) or a list of column names. The unique values of the column (or columns) will be used as the groups of the data frame.

Example

If we don’t use groupby() and directly apply the mean() method to our flipper length column, we obtain the average of all the values in the column:

penguins['flipper_length_mm'].mean()

200.91520467836258

To get the mean flipper length by species we first group our dataset by the species column’s values. However, if we just use the groupby() method without specifying what we wish to calculate on each group, not much happens up front:

penguins.groupby('species')['flipper_length_mm']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x149eda550>

We get a GroupBy object, which is like an intermediate step. It doesn’t perform the actual calculations until we specify an operation:

# Average flipper length per species
penguins.groupby('species')['flipper_length_mm'].mean()

species
Adelie       189.953642
Chinstrap    195.823529
Gentoo       217.186992
Name: flipper_length_mm, dtype: float64

Let’s recap what went on in that line (remember the . can be read as “and then…”):

start with the penguins data frame, and then…
use groupby() to group the data frame by species values, and then…
select the 'flipper_length_mm' column, and then…
calculate the mean() of this column with respect to the groups.

Notice that the name of the series is the same as the column on which we calculated the summary statistc. We can easily update this using the rename() method:

# Average flipper length per species
avg_flipper = (penguins.groupby("species")
                        .flipper_length_mm
                        .mean()
                        .rename('mean_flipper_length')
                        .sort_values(ascending=False)
                        )
avg_flipper

species
Gentoo       217.186992
Chinstrap    195.823529
Adelie       189.953642
Name: mean_flipper_length, dtype: float64

We can also group by combinations of columns.

Example

We want to know what was the number of penguins surveyed in each island on different years. We can use the count() method to count the number of non-NA values in each column like this:

penguins.count()

species              344
island               344
bill_length_mm       342
bill_depth_mm        342
flipper_length_mm    342
body_mass_g          342
sex                  333
year                 344
dtype: int64

When we group by island and year we get the count of non-NA values for each column across each combination of island and year:

penguins.groupby(['island','year']).count()

		species	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
island	year
Biscoe	2007	44	44	44	44	44	43
	2008	64	64	64	64	64	63
	2009	60	59	59	59	59	57
Dream	2007	46	46	46	46	46	45
	2008	34	34	34	34	34	34
	2009	44	44	44	44	44	44
Torgersen	2007	20	19	19	19	19	15
	2008	16	16	16	16	16	16
	2009	16	16	16	16	16	16

Let’s say we want to plot the surveyed population per year and island. We could then use method chaining to do this:

(penguins.groupby(['island','year'])
         .count()
         .species
         .sort_values()
         .plot(kind='barh',
                title='Penguins surveyed at the Palmer Archipelago',
                ylabel=('Island, Year'))
         )

Check-in

Use the max() method for pandas.DataFrames to calculate the maximum value of a penguin’s body mass by year and species.

Use (1) to display the highest body masses per year and species as a bar plot in descending order.

References

[1]

H. Do-Linh, C. Galaz García, M. B. Jones, and C. Vargas Poulsen, Open Science Synthesis training Week 1. NCEAS Learning Hub & Delta Stewardship Council. 2023. Available: https://learning.nceas.ucsb.edu/2023-06-delta/

[2]

A. M. Horst, A. P. Hill, and K. B. Gorman, Palmerpenguins: Palmer archipelago (antarctica) penguin data. 2020. Available: https://allisonhorst.github.io/palmerpenguins/