Snowshoe hares at Bonanza Creek Experimental Forest

Week 3 - Discussion section

This discussion section will guide you through exploring data about snowshoe hares in the (Lepus americanus) in the Bonanza Creek Experimental Forest located in Alaska, USA. In this discussion section, you will:

Practice markdown syntax for creating tables and inserting images
Practice detecting and cleaning messy data
Use groupby() to calculate summary statistics by groups
Select, clean, and comment your code to create a condensed data analysis workflow

Setup

Access the workbench-1 server.
Navigate to theeds-220-sections directory in the file navigation panel and the terminal.
Create a new Python notebook inside your eds-220-sections directory and rename it to section-3-snowshoe-hares.ipynb.
Use the terminal to push this file to you remote repository.

General directions

Add comments as appropriate along your code.
Include markdown cells in between your code cells to add titles and information to each exercise
You won’t need to upload any data.
Indications about when to commit and push changes are included. Commit every time you finish a major step! Remember to write your commits in the imperative mood.

About the data

For these exercises we will use data about Snowshoe hares (Lepus americanus) in the Bonanza Creek Experimental Forest [1].

This dataset is stored in the Environmental Data Initiative (EDI) data repository. This is a huge data repository committed to make data Findable, Accessible, Interoperable, and Reusable (FAIR). It is the main repository for all the data associated to the Long Term Ecological Research Network (LTER).

1. Archive exploration

Take some time to look through the dataset’s description in EDI and click around. Discuss the following questions with your team:
1. What is this data about?
2. During what time frame were the observations in the dataset collected?
3. Does the dataset contain sensitive data?
4. Is there a publication associated with this dataset?
In your notebook: use a markdown cell to add a brief description of the dataset, including a citation, date of access, and a link to the archive.
Back in the EDI repository, click on View Full Metadata to access more information if you haven’t done so already. Go to the “Detailed Metadata” section and click on “Data Entities”. Take some time to look at the descriptions for the dataset’s columns.

2. Adding an image

Back in your notebook, follow these steps to add an image of a hare using a URL:

Go to this link.
Get the URL of the hare image. To do this:

hover over the image –> right click –> “Copy Image Address”.

At the end of the markdown cell with the dataset description, use markdown sytanx to add the image from its URL:

![image description](URL-goes-here)

Do you need to add an attribution in the image description? Check the license at the bottom of wikimedia page.

commit, pull, and push changes

3. Data loading and preliminary exploration

Back in your notebook, import the 55_Hare_Data_2012.txt file from its URL using the pandas.read_csv() function. Store it in a variable named hares.

Using pandas methods, obtain preliminary information and explore this data frame. Consider answering some of these questions:

What are the dimensions of the dataframe and what are the data types of the columns? Do the data types match what you would expect from each column?
Are there any columns that have a significant number of NA values?
What are the minimum and maximum values for the weight and hind feet measurements?
What are the unique values for some of the categorical columns?
An explroatory question about the data frame you come up with!

CHECK IN WITH YOUR TEAM

MAKE SURE YOU’VE ALL SUCCESSFULLY ACCESSED THE DATA BEFORE CONTINUING

commit, pull, and push changes

4. Detecting messy values

In the metadata section of the EDI repository, find which are the allowed values for the hares’ sex. Create a small table in a markdown cell showing the values and their definitions.

Get the number of times each unique sex non-NA value appears.

Check the documentation of value_counts(). What is the purpose of the dropna parameter and what is its default value? Repeat step (a), this time adding the dropna=False parameter to value_counts().
Discuss with your team the output of the unique value counts. In particular:
1. Do the values in the sex column correspond to the values declared in the metadata?
2. What could have been potential causes for multiple codes?
3. Are there seemingly repated values? If so, what could be the cause?
Write code to confirm your suspicions about c-iii.

commit, pull, and push changes

5. Brainstorm

Individually, write step-by-step instructions on how you would wrangle the hares data frame to clean the values in the sex column to have only two classes female and male. Which codes would you assign to each new class? Remember: It’s ok if you don’t know how to code each step - it’s more important to have an idea of what you would like to do.
Discuss your step-by-step instructions with your team.

The next exercise will guide you through cleaning the sex codes. There are many ways of doing this. The one presented here might not be the same way you thought about doing it - that’s ok! This one was designed to practice using the numpy.select() function.

6. Clean values

Create a new column called sex_simple using the numpy.select() function so that

‘F’, ‘f’, and ‘f_’ in the sex column get assigned to ‘female’,
‘M’, ‘m’, and ‘m_’ get assigned to ‘male’, and
anything else gets assigned np.nan

Check the counts of unique values (including NAs) in the new sex_simple column.

commit, pull, and push changes

7. Calculate mean weight

Use groupby() to calculate the mean weight by sex using the new column.

Write a full sentence explaining the results you obtained. Don’t forget to include units.

commit, pull, and push changes

8. Collect your code and explain your results

In a new code cell, collect all the relevant code to create a streamlined workflow to obtain the final result from exercise 7 starting from importing the data. Your code cell should:

Only print the final results for mean weight by sex_simple.
Not include output from intermediate variables or checks.
Not include methods or functions that do not directly contribute to the analysis (even if they don’t print anything ex: df.head()).
If appropriate, combine methods using code chaining instead of creating intermediate variables.
Comment your code following our class comments guidelines.
Use appropriate line breaks and indentation to make code readable.

commit, pull, and push changes

References

[1]

K. Kielland, F. S. Chapin, R. W. Ruess, and Bonanza Creek LTER, “Snowshoe hare physical data in Bonanza Creek Experimental Forest: 1999-Present.” Environmental Data Initiative, 2017. doi: 10.6073/PASTA/03DCE4856D79B91557D8E6CE2CBCDC14. Available: https://portal.edirepository.org/nis/mapbrowse?packageid=knb-lter-bnz.55.22. [Accessed: Oct. 12, 2024]