Prey species in the California drylands

Week 1 - Discussion section

This discussion section will guide you through preliminary data exploration for a real world dataset about animal observations in the California drylands. In this discussion section, you will:

Collaborate with your new team!
Practice version control using git via the terminal
Obtain information about a dataset from an online data repository
Use the pandas.read_csv() function for loading files directly from a URL
Use pandas.DataFrame methods to do preliminary analysis

Setup

In the workbench-1 server, start a new JupyterLab session or access an active one.
In the terminal, use cd to navigate into the eds-220-sections directory. Use pwd to verify eds-220-sections is your current working directory.
Create a new Python notebook inside your eds-220-sections directory and rename it to section-1-data-selection-drylands.ipynb.
Use the terminal to stage, commit, and push this file to the remote repository. Remember:
1. git status : check git status
2. git add FILE-NAME : stage updated file
3. git status : check git status again to confirm
4. git commit -m "Commit message" : commit with message
5. git pull : check local repo is up to date (best practice)
6. git push : push changes to upstream repository

CHECK IN WITH YOUR TEAM

MAKE SURE YOU’VE ALL SUCCESSFULLY SET UP YOUR NOTEBOOKS BEFORE CONTINUING

General directions

Add comments in each one of your code cells.
On each exercise, include markdown cells in between your code cells to add titles and information.
Indications about when to commit and push changes are included, but you are encouraged to commit and push more often.
You won’t need to upload any data.

About the data

For these exercises we will use data about prey items for endangered terrestrial vertebrate species within central California drylands[1] [2].

This dataset is stored in the Knowledge Network for Biocomplexity (KNB) data repository. This is an international repository intended to facilitate ecological and environmental research. It has thousands of open datasets and is hosted by the National Center for Ecological Analysis and Synthesis (NCEAS).

Data collection plot at Mojave Desert near Tecopa. Photo courtesy of Dr. Rachel King.

1. Archive exploration

When possible, data exploration should start at the data repository. Take some time to look through the dataset’s description in the KNB data repository. Discuss the following questions with your team:

What is this data about?
Is this data collected in-situ by the authors or is it a synthesis of multiple datasets?
During what time frame were the observations in the dataset collected?
Does this dataset come with an associated metadata file?
Does the dataset contain sensitive data?

In your notebook: use a markdown cell to add a brief description of the dataset, including a citation, date of access, and a link to the archive.

check git status -> stage changes -> check git status -> commit with message -> pull -> push changes

2. Metadata exploration

You may have noticed there are two metadata files: Compiled_occurrence_records_for_prey_items_of.xml and metadata_arth_occurrences.csv. The .xml document file type is EML which stands for EML: Ecological Metadata Language. This is a machine-readable file that has metadata about the whole dataset. In this section we will only use the metadata in the CSV file.

Back in your notebook, import the pandas package using standard abbreviation in a code cell. Then follow these steps to read in the metadata CSV using the pandas.read_csv() function:

Navigate to the data package site and copy the URL to access the metadata_arth_occurrences CSV file. To copy the URL:

hover over the Download button –> right click –> “Copy Link”.

Read in the data from the URL using the pd.read_csv() function like this:
```
# Access metadata from repository
pd.read_csv('the URL goes here')
```
Take a minute to look at the descriptions for the columns.

Note: Not all datasets have column descriptions in a CSV file. Often they come with a .doc or .txt file with information.

3. Data loading

Follow steps (a) and (b) from the previous exercise to read in the drylands prey data file arth_occurrences_with_env.csv using pd.read_csv(). Store the dataframe to a variable called prey like this:

# Load data
prey = pd.read_csv('the URL goes here')

What is the type of the prey variable? Use a Python function get this information.

check git status -> stage changes -> check git status -> commit with message -> pull -> push changes

CHECK IN WITH YOUR TEAM

MAKE SURE YOU’VE ALL SUCCESSFULLY ACCESSED THE DATA BEFORE CONTINUING

4. Look at your data

Run prey in a cell. What do you notice in the columns section?
To see all the column names in the same display we need to set a pandas option. Run the following command and then look at the prey data again:

pd.set_option("display.max.columns", None)

Add a comment explaining what pd.set_option("display.max.columns", None) does.

check git status -> stage changes -> check git status -> commit with message -> pull -> push changes

5. `pd.DataFrame` preliminary exploration

Run each of the following methods for prey in a different cell and write a brief description of what they do as a comment:

head()
tail()
info()
nunique()

For example:

# head()
# returns the first five rows of the data frame
prey.head()

If you’re not sure about what the method does, try looking it up in the pandas.DataFrame documentation.

Check the documentation for head(). If this function has any optional parameters, change the default value to get a different output.

Print each of the following attributes of prey in a different cell and write a brief explanation of what they are as a comment:

shape
columns
dtypes

If you’re not sure about what information is the attribute showing, look it up in the pandas.DataFrame documentation.

check git status -> stage changes -> check git status -> commit with message -> pull -> push changes

6. Update column names

Change the column names of institutionCode and datasetKey to institution_code and dataset_key, respectively. Make sure you’re actually updating the dataframe. HINT: look for the documentation on the rename method for pandas.DataFrames.

check git status -> stage changes -> check git status -> commit with message -> pull -> push changes

References

[1]

R. King, J. Braun, M. Westphal, and C. Lortie, “Compiled occurrence records for prey items of listed species found in California drylands with associated environmental data.” KNB Data Repository, 2023. doi: 10.5063/F1VM49RH. Available: https://knb.ecoinformatics.org/view/doi:10.5063/F1VM49RH. [Accessed: Aug. 26, 2024]

[2]

C. J. Lortie, J. Braun, R. King, and M. Westphal, “The importance of open data describing prey item species lists for endangered species,” Ecological Solutions and Evidence, vol. 4, no. 2, p. e12251, Apr. 2023, doi: 10.1002/2688-8319.12251. Available: https://besjournals.onlinelibrary.wiley.com/doi/10.1002/2688-8319.12251. [Accessed: Aug. 26, 2024]

Setup

General directions

About the data

1. Archive exploration

2. Metadata exploration

3. Data loading

4. Look at your data

5. pd.DataFrame preliminary exploration

6. Update column names

References

5. `pd.DataFrame` preliminary exploration