Water conflicts in the Colorado River Basin

Week 2 - Discussion section

This discussion section will guide you through exploring data about water-related conflicts at the Colorado River Basin using data from the U.S. Geological Survey (USGS). In this discussion section, you will:

Practice version control using git via the terminal
Use methods to work with pandas.Series of strings using the .str accessor
Practice method chaining

Setup

In the workbench-1 server, start a new JupyterLab session or access an active one.
In the terminal, use cd to navigate into the eds-220-sections directory. Use pwd to verify eds-220-sections is your current working directory.
Create a new Python notebook inside your eds-220-sections directory and rename it to section-2-co-basin-water-conflicts.ipynb.
Use the terminal to stage, commit, and push this file to the remote repository. Remember:
1. git status : check git status
2. git add FILE-NAME : stage updated file
3. git status : check git status again to confirm
4. git commit -m "Commit message" : commit with message
5. git pull : check local repo is up to date (best practice)
6. git push : push changes to upstream repository

CHECK IN WITH YOUR TEAM

MAKE SURE YOU’VE ALL SUCCESSFULLY SET UP YOUR NOTEBOOKS BEFORE CONTINUING

General directions

Add comments in each one of your code cells.
Include markdown cells in between your code cells to add titles and information.
Indications about when to commit and push changes are included, but you are encouraged to commit and push more often.

About the data

For these exercises we will use data about Water Conflict and Crisis Events in the Colorado River Basin [1]. This dataset is stored at ScienceBase, a digital repository from the U.S. Geological Survey (USGS) created to share scientific data products and USGS resources.

The dataset is a CSV file containing conflict or crisis around water resource management in the Colorado River Basin. The Colorado River Basin, inhabited by several Native American tribes for centuries, is a crucial water source in the southwestern United States and northern Mexico, supporting over 40 million people, extensive agricultural lands, and diverse ecosystems. Its management is vital due to the region’s arid climate and the competing demands for water, leading to significant challenges related to water allocation and conservation.

Colorado River Basin. U.S. Bureau of Reclamation.

1. Archive exploration

Look through the dataset’s description in the ScienceBase repository. Find the following information:
1. Where was the data collected from??
2. During what time frame were the observations in the dataset collected?
3. What was the author’s perceived value of this dataset?
In a markdown cell, use your answers to the previous questions to add a brief description of the dataset. Briefly discuss anything else that seems relevant to you. Include a citation, date of access, and a link to the archive.
Take a look at the data’s metadata by clicking on the “View” icon of the Coded Events Colorado River Basin Water Conflict Table Metadata.xml file.

check git status -> stage changes -> check git status -> commit with message -> pull -> push changes

2. Data loading

Create a new directory data/ inside your eds-220-sections directory.
Download the Colorado River Basin Water Conflict Table.csv file from the Science Base repository and upload it into the data/ folder.
Update the .gitignore file of your eds-220-sections so it ignores the data/ folder. Push the changes to this file. Verify that git is ignoring the data file. Note: If you update the .gitignore file via GitHub, you need to run git pull when you go back to the server.
Load the data into your section-2-co-basin-water-conflicts.ipynb notebook. Name your data frame variable df.

3. Preliminary data exploration

Set pandas to display all columns in the data frame.
Using pandas methods, obtain preliminary information and explore this data frame in at least four different ways.

CHECK IN WITH YOUR TEAM 🙌

YOU CAN SLACK THEM TO LET THEM KNOW YOU’RE READY FOR TOMORROW OR BRING UP ANY QUESTIONS

MAKE SURE YOU’VE ALL SUCCESSFULLY LOADED THE DATA AND DONE A PRELIMINARY EXPLORATION BEFORE CONTINUING

check git status -> stage changes -> check git status -> commit with message -> pull -> push changes

4. Location column descriptions

In these exercises we will work with columns in the data frame pertaining to the location of an event. Before continuing, read the following column descriptions form the .xml metadata file:

Column	Description
Place	Where the event actually occurred, but also where the event’s direct implications are felt most directly. When the researchers reviewed the articles, they were looking for mentions of specific places impacted by the events. Empty cell indicates a place was not coded for this event. NA indicates a place is not referenced in the event text.
State	State Name coded from Place field. Empty cell indicates a state was not coded for this event or that the article was not coded.

5. String accessor for `pandas.Series`

In the following exercises we will work with pandas.Series whose values are strings. This is a common scenario, so pandas has special string methods for this kind of series. These methods are accessed via the str accessor. Accessors provide additional functionality for working with specific kinds of data (in this case, strings).

The code below gives a brief demonstration of the using the str accessor to use the split() method for pandas.Series. Carefully read the code and check in with your team to see if you have questions about it. We’ll use it in a moment.

import numpy as np
import pandas as pd 

# Example series
s = pd.Series(['California; Nevada', 'Arizona', np.nan, 'Nevada; Utah'])
s

0    California; Nevada
1               Arizona
2                   NaN
3          Nevada; Utah
dtype: object

# str accessor (doesn't do anything by itself)
s.str

<pandas.core.strings.accessor.StringMethods at 0x10ad43d90>

# Use str accessor with additional methods to perform string operations
# .split splits strings by ';' and expands output into separate columns
s.str.split(';', expand=True)

	0	1
0	California	Nevada
1	Arizona	None
2	NaN	NaN
3	Nevada	Utah

# Use stack() method to flatten the data frame into a series
# default is to drop NAs and None from result
s.str.split(';', expand=True).stack()

0  0    California
   1        Nevada
1  0       Arizona
3  0        Nevada
   1          Utah
dtype: object

6. Examine state codes

Our goal today is to find which states are reported in the dataset as having a water conflicts.

What are the unique values in the States column? What could be a challenge to writing code to find which states are listed (without repetition)? Remember to write longer answers in mardown cells, not as comments.

7. Brainstorm

Individually, write step-by-step instructions on how you would wrangle the data frame df to obtain a list (without repetition) of the state codes in which a water conflict has been reported. It’s ok if you don’t know how to code each step - it’s more important to have an idea of what you would like to do.
Discuss your step-by-step instructions with your team.

The next exercises will guide you through finding the unique state codes in the dataset. There are many ways of extracting this information. The one presented here might not be the same way you thought about doing it - that’s ok! This one was designed to practice using the .str accessor in a pandas.Series.

8. Exploratory wrangling

Perform the following wrangling:
1. select the State column from the df data frame
2. split the strings in the column by the delimeter ; into different columns
3. stack the results of the resulting data frame into a single pandas.Series
4. find the unique string values in the resulting series

Your final answer should use method chaining without creating new variables.

CHECK IN WITH YOUR TEAM: IS EVERY STEP IN THE CHAINING CLEAR?

check git status -> stage changes -> check git status -> commit with message -> pull -> push changes

9. Find unique state codes

Discuss with your team: Why do some state codes seem to be repeated? What would we need to do to get the correct strings?
Update your code to obtain a list of codes (without repetition) of the states mentioned in the news articles about water conflicts in the Colorado River Basin. Hint: str.strip().

Bonus: How many articles mention each state?

check git status -> stage changes -> check git status -> commit with message -> pull -> push changes

References

[1]

D. V. Holloman, M. K. Hines, and D. K. Zoanni, “Coded Water Conflict and Crisis Events in the Colorado River Basin, Derived from LexisNexis search 2005-2021.” U.S. Geological Survey, 2023. doi: 10.5066/P9X6WR7J. Available: https://www.sciencebase.gov/catalog/item/63acac09d34e92aad3ca1480. [Accessed: Sep. 27, 2024]

Setup

General directions

About the data

1. Archive exploration

2. Data loading

3. Preliminary data exploration

4. Location column descriptions

5. String accessor for pandas.Series

6. Examine state codes

7. Brainstorm

8. Exploratory wrangling

9. Find unique state codes

References

5. String accessor for `pandas.Series`