Snowshoe hares at Bonanza Creek Experimental Forest
Week 3 - Discussion section
This discussion section will guide you through exploring data about snowshoe hares in the (Lepus americanus) in the Bonanza Creek Experimental Forest located in Alaska, USA. In this discussion section, you will:
- Practice markdown syntax for creating tables and inserting images
- Practice detecting and cleaning messy data
- Use
groupby()
to calculate summary statistics by groups - Select, clean, and comment your code to create a condensed data analysis workflow
Setup
General directions
About the data
For these exercises we will use data about Snowshoe hares (Lepus americanus) in the Bonanza Creek Experimental Forest [1].
This dataset is stored in the Environmental Data Initiative (EDI) data repository. This is a huge data repository committed to make data Findable, Accessible, Interoperable, and Reusable (FAIR). It is the main repository for all the data associated to the Long Term Ecological Research Network (LTER).
1. Archive exploration
Take some time to look through the dataset’s description in EDI and click around. Discuss the following questions with your team:
- What is this data about?
- During what time frame were the observations in the dataset collected?
- Does the dataset contain sensitive data?
- Is there a publication associated with this dataset?
In your notebook: use a markdown cell to add a brief description of the dataset, including a citation, date of access, and a link to the archive.
Back in the EDI repository, click on View Full Metadata to access more information if you haven’t done so already. Go to the “Detailed Metadata” section and click on “Data Entities”. Take some time to look at the descriptions for the dataset’s columns.
2. Adding an image
Back in your notebook, follow these steps to add an image of a hare using a URL:
Go to this link.
Get the URL of the hare image. To do this:
- hover over the image –> right click –> “Copy Image Address”.
- At the end of the markdown cell with the dataset description, use markdown sytanx to add the image from its URL:
![image description](URL-goes-here)
- Do you need to add an attribution in the image description? Check the license at the bottom of wikimedia page.
commit, pull, and push changes
3. Data loading and preliminary exploration
- Back in your notebook, import the
55_Hare_Data_2012.txt
file from its URL using thepandas.read_csv()
function. Store it in a variable namedhares
.
- Using
pandas
methods, obtain preliminary information and explore this data frame. Consider answering some of these questions:
- What are the dimensions of the dataframe and what are the data types of the columns? Do the data types match what you would expect from each column?
- Are there any columns that have a significant number of NA values?
- What are the minimum and maximum values for the weight and hind feet measurements?
- What are the unique values for some of the categorical columns?
- An explroatory question about the data frame you come up with!
CHECK IN WITH YOUR TEAM
MAKE SURE YOU’VE ALL SUCCESSFULLY ACCESSED THE DATA BEFORE CONTINUING
commit, pull, and push changes
4. Detecting messy values
- In the metadata section of the EDI repository, find which are the allowed values for the hares’ sex. Create a small table in a markdown cell showing the values and their definitions.
- Get the number of times each unique sex non-NA value appears.
Check the documentation of
value_counts()
. What is the purpose of thedropna
parameter and what is its default value? Repeat step (a), this time adding thedropna=False
parameter tovalue_counts()
.Discuss with your team the output of the unique value counts. In particular:
- Do the values in the
sex
column correspond to the values declared in the metadata? - What could have been potential causes for multiple codes?
- Are there seemingly repated values? If so, what could be the cause?
- Do the values in the
Write code to confirm your suspicions about c-iii.
commit, pull, and push changes
5. Brainstorm
Individually, write step-by-step instructions on how you would wrangle the
hares
data frame to clean the values in thesex
column to have only two classesfemale
andmale
. Which codes would you assign to each new class? Remember: It’s ok if you don’t know how to code each step - it’s more important to have an idea of what you would like to do.Discuss your step-by-step instructions with your team.
The next exercise will guide you through cleaning the sex codes. There are many ways of doing this. The one presented here might not be the same way you thought about doing it - that’s ok! This one was designed to practice using the numpy.select()
function.
6. Clean values
- Create a new column called
sex_simple
using thenumpy.select()
function so that
- ‘F’, ‘f’, and ‘f_’ in the
sex
column get assigned to ‘female’, - ‘M’, ‘m’, and ‘m_’ get assigned to ‘male’, and
- anything else gets assigned
np.nan
- Check the counts of unique values (including NAs) in the new
sex_simple
column.
commit, pull, and push changes
7. Calculate mean weight
- Use
groupby()
to calculate the mean weight by sex using the new column.
- Write a full sentence explaining the results you obtained. Don’t forget to include units.
commit, pull, and push changes
8. Collect your code and explain your results
In a new code cell, collect all the relevant code to create a streamlined workflow to obtain the final result from exercise 7 starting from importing the data. Your code cell should:
- Only print the final results for mean weight by
sex_simple
. - Not include output from intermediate variables or checks.
- Not include methods or functions that do not directly contribute to the analysis (even if they don’t print anything ex:
df.head()
). - If appropriate, combine methods using code chaining instead of creating intermediate variables.
- Comment your code following our class comments guidelines.
- Use appropriate line breaks and indentation to make code readable.
commit, pull, and push changes