11 Reprojecting

In this section we will learn how to join dataframes and will apply this to creating a choropleth map with geopandas.

About the data

The first dataset we will use is a list of Arctic communities and their location [1] created by the Alaska Native Tribal Health Consortium. This data comes in a GeoJSON file with the following attributes:

name: name of Arctic community,
population: population of Arctic community, as of 2022
country: country that the Arctic community falls within (see dataset metadata for the codes)
geoname-id: numeric codes that uniquely identify all administrative/legal and statistical geographic areas for which the Census Bureau tabulates data

The second dataset is Natural Earth’s medium-scale cultural boundaries data for countries (1:50m). We can obtain this dataset by downloading the shapefile. Natural Earth is a public domain dataset with free, ready-to-use data for creating maps.

The third dataset we will use is a CSV file with the country codes and names of the Arctic countries in the Arctic communities dataset. This dataset was created for educational purposes for this lesson based on the metadata of the Arctic communities dataset and the country names in Natural Earth’s dataset. It can be accessed here.

Import data

We will first import the countries shapefile and adapt it for wrangling purposes:

import os
import pandas as pd
import matplotlib.pyplot as plt
import geopandas as gpd

# Import countries polygons
fp = os.path.join('data', 'ne_50m_admin_0_countries', 'ne_50m_admin_0_countries.shp')
countries = gpd.read_file(fp)

# Simplify column names
countries.columns = countries.columns.str.lower()

# Select columns for analysis
countries = countries[['admin', 'type', 'geometry']]

countries.head()

	admin	type	geometry
0	Zimbabwe	Sovereign country	POLYGON ((31.28789 -22.40205, 31.19727 -22.344...
1	Zambia	Sovereign country	POLYGON ((30.39609 -15.64307, 30.25068 -15.643...
2	Yemen	Sovereign country	MULTIPOLYGON (((53.08564 16.64839, 52.58145 16...
3	Vietnam	Sovereign country	MULTIPOLYGON (((104.06396 10.39082, 104.08301 ...
4	Venezuela	Sovereign country	MULTIPOLYGON (((-60.82119 9.13838, -60.94141 9...

# Quick view
countries.plot()

Next, we import the Arctic communities data. Similar to how we previously used pandas.read_csv(), we can read in the Arctic communities GeoJSON data directly from the data repository using geopandas.read_file():

# Import Arctic communities data
URL = 'https://cn.dataone.org/cn/v2/resolve/urn%3Auuid%3Aed7718ae-fb0d-43dd-9270-fbfe80bfc7a4'
communities = gpd.read_file(URL)

communities.head()

	name	population	country	geoname-id	geometry
0	Udomlya	32373	RU	452949	POINT (34.99250 57.87944)
1	Valmiera	26963	LV	453754	POINT (25.42751 57.54108)
2	Ventspils	42644	LV	454310	POINT (21.57288 57.38988)
3	Vec-Liepāja	85260	LV	454432	POINT (21.01667 56.53333)
4	Tukums	18348	LV	454768	POINT (23.15528 56.96694)

The CRS of the communities is EPSG:4326. Remember all GeoJSON files are given in this CRS and all points are expressed in longitude and latitude units of decimal degrees.

print(countries.crs)

EPSG:4326

Since the CRSs of our geospatila data match, it is easy to take a quick look at our communities data by plotting it on top of the countries dataframe:

# Verify CRSs match
assert countries.crs == communities.crs

fig, ax = plt.subplots()
countries.plot(ax=ax)
communities.plot(ax=ax, color='red')
plt.show()

Finally, we import the country names and codes CSV:

country_names = pd.read_csv(os.path.join('data','country_names.csv'))
country_names

	country	admin
0	RU	Russia
1	LV	Latvia
2	EE	Estonia
3	LT	Lithuania
4	SE	Sweden
5	BY	Belarus
6	FI	Finland
7	NO	Norway
8	DK	Denmark
9	FO	Faroe Islands
10	IS	Iceland
11	GB	United Kingdom
12	AX	Aland Islands
13	GL	Greenland
14	US	United States of America
15	CA	Canada

Arctic communities by country

Next, we want to calculate the number of Arctic communities by country.

# Create data frame with number of communities per country
n_comms = (communities.groupby('country')
                      .size()
                      .reset_index(name='n_communities'))

Let’s break this down:

We start with our communities dataframe and use groupby('country') to group by country code.
Then we use size() as an aggregator function to calculate the size of each group.
The result of this operation is a pandas.Series indexed by the country values.
By resetting the index we transform the pandas.Series into a pandas.DataFrame, the index is now a column named country and the values of the series are named n_communities.

# Number of Arctic communities per country
n_comms

	country	n_communities
0	AX	1
1	BY	8
2	CA	7
3	DK	72
4	EE	14
5	FI	98
6	FO	1
7	GB	96
8	GL	1
9	IS	5
10	LT	26
11	LV	25
12	NO	48
13	RU	774
14	SE	133
15	US	115

`if`-`else` statements

Our goal is to merge the n_comms and the countries data frames. To merge two data frames they need to have at least one column in common. Currently our datasets do not have any columns in common:

set(countries.columns).intersection(n_comms.columns)

set()

The output set() represents the empty set. This might not be as informative, so let’s write a different information statement:

common_columns = set(countries.columns).intersection(n_comms.columns)

# Check if there are any common columns
if len(common_columns) != 0:
    print(f"Common columns: {common_columns}")
else:
    print("No common columns")

No common columns

Remember that an if-else statement is a control structure that allows code to make decisions: it checks a condition, and if that condition is true, it executes one block of code (the if block); if the condition is false, it executes a different block (the else block). This enables programs to respond differently depending on specific criteria or inputs.

Check-in

Wrap up the previous code into a function called check_common_columns that prints a message depending of whether two data frames have common columns or not. Don’t forget to include a docstring!

Merging data frames

We can use the country_names data frame to add the names countries into the n_comms data which, in turn, will allow us to merge that data frame with the country_names data. To merge dataframes we can use the pandas.merge() function. The basic syntax for it is:

output_df = pd.merge(left_df,
                     right_df, 
                     how = type_of_join, 
                     on = column_to_join)

where

output_df is the dataframe resulting from the merge,
left_df is the dataframe we have “on the left side”,
right_df is the dataframe we have “on the right side”,
how specifies the type of join between the left and right dataframes, (check the options here), the default is to do an inner join,
on specifies the column to join on, this column must be present in both our dataframes.

When merging a geopandas.GeoDataFrame with a pandas.DataFrame, the geopandas.GeoDataFrame must be “on the left” to retain the geospatial information. Otherwise, the output will be a pandas.DataFrame.

So, we merge the n_comms and country_names data frames using a left join:

Image source: Data Modeling Essentials, NCEAS Learning Hub [2]

n_comms = pd.merge(n_comms,
                   country_names,
                   how='left',
                   on='country')
n_comms

	country	n_communities	admin
0	AX	1	Aland Islands
1	BY	8	Belarus
2	CA	7	Canada
3	DK	72	Denmark
4	EE	14	Estonia
5	FI	98	Finland
6	FO	1	Faroe Islands
7	GB	96	United Kingdom
8	GL	1	Greenland
9	IS	5	Iceland
10	LT	26	Lithuania
11	LV	25	Latvia
12	NO	48	Norway
13	RU	774	Russia
14	SE	133	Sweden
15	US	115	United States of America

We can reuse our function to check that n_comms and countries now have a common column on which we can merge them:

check_common_columns(n_comms, countries)

Common columns: {'admin'}

Check-in

Create a new variable named arctic_countries which is the result of an inner join between our countries and n_comms dataframes. The inner joun will merge the subset of rows that have matches in both the left table and the right table.

Image source: Data Modeling Essentials, NCEAS Learning Hub [2]

Reviewing results

Notice that the row for Aland Islands is not present in the merged dataframe:

arctic_countries

	type	geometry	country	n_communities
admin
United States of America	Country	MULTIPOLYGON (((-132.74687 56.52568, -132.7576...	US	115
United Kingdom	Country	MULTIPOLYGON (((-2.66768 51.62300, -2.74214 51...	GB	96
Sweden	Sovereign country	MULTIPOLYGON (((19.07646 57.83594, 18.99375 57...	SE	133
Russia	Sovereign country	MULTIPOLYGON (((145.88154 43.45952, 145.89561 ...	RU	774
Norway	Sovereign country	MULTIPOLYGON (((20.62217 69.03687, 20.49199 69...	NO	48
Lithuania	Sovereign country	MULTIPOLYGON (((20.95781 55.27891, 20.89980 55...	LT	26
Latvia	Sovereign country	POLYGON ((26.59355 55.66753, 26.54287 55.67241...	LV	25
Iceland	Sovereign country	POLYGON ((-15.54312 66.22852, -15.42847 66.224...	IS	5
Finland	Country	MULTIPOLYGON (((24.15547 65.80527, 24.04902 65...	FI	98
Estonia	Sovereign country	MULTIPOLYGON (((27.35195 57.52812, 27.32656 57...	EE	14
Greenland	Country	MULTIPOLYGON (((-29.95288 83.56484, -28.99199 ...	GL	1
Faroe Islands	Dependency	MULTIPOLYGON (((-6.62319 61.80596, -6.64277 61...	FO	1
Denmark	Country	MULTIPOLYGON (((12.56875 55.78506, 12.57119 55...	DK	72
Canada	Sovereign country	MULTIPOLYGON (((-132.65552 54.12749, -132.5640...	CA	7
Belarus	Sovereign country	POLYGON ((31.76338 52.10107, 31.57373 52.10811...	BY	8

The values attribute of a data frame returns all the values in the data frame as an array. We can verify the value ‘Aland Islands’ was nowhere in our original countries dataframe like this:

# Check Aland Islands is nowhere in data frame
'Aland Islands' not in countries.values

True

The Aland Islands is an autonomous region of Finland and there is one Arctic community registered in this region. We will directly add one to Finland to not lose this piece of data:

arctic_countries.at['Finland', 'n_communities'] += 1

Choropleth map

A choropleth map is an efficient way to visualize aggregate data per region.

Making a choropleth map from our polygons GeoDataFrame is easy; we just need to specify the column parameter in plot() and make it equal to the column with the values we want to plot in each country:

arctic_countries.plot(column='n_communities',
                      legend=True)

Reprojecting

Remember that CRSs reflect cultural views and even biases. Any map projection involves choices about which areas to emphasize, minimize, or distort, and those choices can influence how viewers perceive different regions. In our map, using the EPSG:4326 CRS is, among other things, mapping the Arctic regions as far apart, while they are actually near each other.

Reprojecting means changing the coordinate reference system of your geospatial data. In our case, we will reproject the Alaska geo-dataframe to the CRS EPSG:3413. This CRS is a projected CRS, better suited for working with data from the Arctic region:

Changing CRSs in GeoPandas is very simple using the to_crs() method for geopandas.GeoDataFrames. The general syntax is:

updated_geodf = geodf.to_crs(new_crs)

where:

updated_geodf is the output of the method, a new geodataframe (to_crs() does not work in place),
geodf is the geopandas.GeoDataFrame we want to transform,
new_crs the CRS we want to convert to, this is an object of type CRS or string representing the CRS (ex: 'epsg:3413').

In our case:

# Reproject to CRS optimized for Arctic region
arctic_countries = arctic_countries.to_crs('epsg:3413')

We can now use the reprojected data to update our map:

Code

fig, ax = plt.subplots(figsize=(8, 6))

# Remove the axis for a cleaner map
ax.axis('off')

# Create choropleth map of communities
# Plot with refined color and edge style
arctic_countries.plot(
    ax=ax,
    column='n_communities',
    cmap='PuBuGn',
    legend=True,
    edgecolor="0.6",
    linewidth=0.5,
    legend_kwds={
        "shrink": 0.7,
        "label": "Number of Arctic Communities",
        "orientation": "horizontal",
        "pad": 0.05
    }
)

# Add title and subtitle for better context
ax.set_title('Distribution of Arctic Communities', fontsize=18, weight='bold', pad=15)

plt.show()

Although the new projection clearly improves the presentation of the data, there are still issues with this plot! Mainly, the entire United States territory is in it, when we should only have Alaska. In our next lesson we will review startegies to clip and subset vector data and return to this plot in our discussion section.

References

[1]

M. Brook, “Approximate Arctic Communities and Populations, (latitude >= 55, 2022).” Arctic Data Center, 2023. doi: 10.18739/A28S4JQ80. Available: https://arcticdata.io/catalog/view/doi:10.18739/A28S4JQ80. [Accessed: Nov. 03, 2024]

[2]

H. Do-Linh, C. Galaz García, M. B. Jones, and C. Vargas Poulsen, Open Science Synthesis training Week 1. NCEAS Learning Hub & Delta Stewardship Council. 2023. Available: https://learning.nceas.ucsb.edu/2023-06-delta/

11 Reprojecting

About the data

Import data

Arctic communities by country

if-else statements

Merging data frames

Reviewing results

Choropleth map

Reprojecting

References

`if`-`else` statements