import numpy as np
import pandas as pd
# Example series
= pd.Series(['California; Nevada', 'Arizona', np.nan, 'Nevada; Utah'])
s s
0 California; Nevada
1 Arizona
2 NaN
3 Nevada; Utah
dtype: object
This wesbite is under construction.
Week 2 - Discussion section
This discussion section will guide you through exploring data about water-related conflicts at the Colorado River Basin using data from the U.S. Geological Survey (USGS). In this discussion section, you will:
pandas.Series
of strings using the .str
accessorIn the workbench-1 server, start a new JupyterLab session or access an active one.
In the terminal, use cd
to navigate into the eds-220-sections
directory. Use pwd
to verify eds-220-sections
is your current working directory.
Create a new Python notebook inside your eds-220-sections
directory and rename it to section-2-co-basin-water-conflicts.ipynb
.
Use the terminal to stage, commit, and push this file to the remote repository. Remember:
git status
: check git statusgit add FILE-NAME
: stage updated filegit status
: check git status again to confirmgit commit -m "Commit message"
: commit with messagegit pull
: check local repo is up to date (best practice)git push
: push changes to upstream repositoryCHECK IN WITH YOUR TEAM
MAKE SURE YOU’VE ALL SUCCESSFULLY SET UP YOUR NOTEBOOKS BEFORE CONTINUING
For these exercises we will use data about Water Conflict and Crisis Events in the Colorado River Basin [1]. This dataset is stored at ScienceBase, a digital repository from the U.S. Geological Survey (USGS) created to share scientific data products and USGS resources.
The dataset is a CSV file containing conflict or crisis around water resource management in the Colorado River Basin. The Colorado River Basin, inhabited by several Native American tribes for centuries, is a crucial water source in the southwestern United States and northern Mexico, supporting over 40 million people, extensive agricultural lands, and diverse ecosystems. Its management is vital due to the region’s arid climate and the competing demands for water, leading to significant challenges related to water allocation and conservation.
Look through the dataset’s description in the ScienceBase repository. Find the following information:
In a markdown cell, use your answers to the previous questions to add a brief description of the dataset. Briefly discuss anything else that seems relevant to you. Include a citation, date of access, and a link to the archive.
Take a look at the data’s metadata by clicking on the “View” icon of the Coded Events Colorado River Basin Water Conflict Table Metadata.xml
file.
check git status -> stage changes -> check git status -> commit with message -> pull -> push changes
Create a new directory data/
inside your eds-220-sections
directory.
Download the Colorado River Basin Water Conflict Table.csv
file from the Science Base repository and upload it into the data/
folder.
Update the .gitignore
file of your eds-220-sections
so it ignores the data/
folder. Push the changes to this file. Verify that git is ignoring the data file. Note: If you update the .gitignore
file via GitHub, you need to run git pull
when you go back to the server.
Load the data into your section-2-co-basin-water-conflicts.ipynb
notebook. Name your data frame variable df
.
Set pandas
to display all columns in the data frame.
Using pandas
methods, obtain preliminary information and explore this data frame in at least four different ways.
CHECK IN WITH YOUR TEAM 🙌
YOU CAN SLACK THEM TO LET THEM KNOW YOU’RE READY FOR TOMORROW OR BRING UP ANY QUESTIONS
MAKE SURE YOU’VE ALL SUCCESSFULLY LOADED THE DATA AND DONE A PRELIMINARY EXPLORATION BEFORE CONTINUING
check git status -> stage changes -> check git status -> commit with message -> pull -> push changes
In these exercises we will work with columns in the data frame pertaining to the location of an event. Before continuing, read the following column descriptions form the .xml metadata file:
Column | Description |
---|---|
Place | Where the event actually occurred, but also where the event’s direct implications are felt most directly. When the researchers reviewed the articles, they were looking for mentions of specific places impacted by the events. Empty cell indicates a place was not coded for this event. NA indicates a place is not referenced in the event text. |
State | State Name coded from Place field. Empty cell indicates a state was not coded for this event or that the article was not coded. |
pandas.Series
In the following exercises we will work with pandas.Series
whose values are strings. This is a common scenario, so pandas
has special string methods for this kind of series. These methods are accessed via the str
accessor. Accessors provide additional functionality for working with specific kinds of data (in this case, strings).
str
accessor to use the split()
method for pandas.Series
. Carefully read the code and check in with your team to see if you have questions about it. We’ll use it in a moment.import numpy as np
import pandas as pd
# Example series
s = pd.Series(['California; Nevada', 'Arizona', np.nan, 'Nevada; Utah'])
s
0 California; Nevada
1 Arizona
2 NaN
3 Nevada; Utah
dtype: object
<pandas.core.strings.accessor.StringMethods at 0x10ad43d90>
Our goal today is to find which states are reported in the dataset as having a water conflicts.
States
column? What could be a challenge to writing code to find which states are listed (without repetition)? Remember to write longer answers in mardown cells, not as comments.Individually, write step-by-step instructions on how you would wrangle the data frame df
to obtain a list (without repetition) of the state codes in which a water conflict has been reported. It’s ok if you don’t know how to code each step - it’s more important to have an idea of what you would like to do.
Discuss your step-by-step instructions with your team.
The next exercises will guide you through finding the unique state codes in the dataset. There are many ways of extracting this information. The one presented here might not be the same way you thought about doing it - that’s ok! This one was designed to practice using the .str
accessor in a pandas.Series
.
df
data frame;
into different columnspandas.Series
Your final answer should use method chaining without creating new variables.
CHECK IN WITH YOUR TEAM: IS EVERY STEP IN THE CHAINING CLEAR?
check git status -> stage changes -> check git status -> commit with message -> pull -> push changes
Discuss with your team: Why do some state codes seem to be repeated? What would we need to do to get the correct strings?
Update your code to obtain a list of codes (without repetition) of the states mentioned in the news articles about water conflicts in the Colorado River Basin. Hint: str.strip().
check git status -> stage changes -> check git status -> commit with message -> pull -> push changes