Assignment 1

This assignment covers topics in the notes from the Python review to the plotting with pandas lesson. Task 1 will contribute 20% to the total grade of the assignment and tasks 2 and 3 will contribute 40% each.

Submission instructions

This assignment is due by 11:59 pm on Saturday, September 12. All tasks for this assignment should be submitted via Gradescope. Make sure you double-check your submission to ensure it satisfies all the items in this checklist:

Answer for task 1 must be a PDF file.
Answers for tasks 2 and 3 must be submitted as .ipynb files (Jupyter Notebooks) to Gradescope, not a PDF, html or other format.
Ensure your notebooks include a link to your assignment’s GitHub repository in the designated section.
The notebooks you submit must have your solutions to the exercises, They should not be the blank template notebooks.
The notebooks you submit must include your code and all required rendered plots, graphs, and printed output. Run all cells before submitting your .ipynb file and make sure all the outputs are visible.
Double-check that each notebook or PDF is uploaded to the correct task on Gradescope.

Resubmissions after the due date due to not satisfying one of the checks above will be strictly held to the course’s 50%-regrade resubmission policy (see syllabus).

If you have any questions about assignment logistics, please reach out to the TA or instructor by 5 pm Friday, September 11.

Rename homework notebooks before uploading them to Gradescope

For your upcoming assignment submission, you’ll be downloading your notebooks and then uploading them to Gradescope. Before you upload your finished notebooks to Gradescope, please rename your notebooks so they are called

hwk1-task2-corals-YOURLASTNAME.ipynb and
hwk1-task3-earthquakes-YOURLASTNAME.ipynb.

It’s important to do this so we can keep track of resubmissions.

Thanks!

Updates to Gradescope’s autograder

Here’s updates about how auto-grading will work in this first assignment:

If you want to know your autograder score at any point, you may upload your notebook to the Homework 1 Task 2 - AUTOGRADER CHECK ONLY or Homework 1 Task 3 - AUTOGRADER CHECK ONLY assignments on gradescope.
- Once you submit your assignment, you will be able to see your total score for the auto-grading, not the score for individual questions.
- If you don’t have a 20/20 score in your auto-grade questions, it means there is some mistake with your code and you should go back and review it. If you can’t figure out where the issue is, discuss it with other people (first option always!), come see Annie or Carmen during OH, or use Slack.
The AUTOGRADER CHECK ONLY assignments on gradescope are strictly for you to see how you did on the assignment. We will not be using these grades at all
You must still submit your final assignment to the Homework 1- Task 2 - Corals and Homework 1 - Task 3 - Earthquakes assignment
Make sure you’re keeping up with your classmate’s questions and answers on Slack.
When submitting your final notebook, please make sure to follow the instructions above regarding how to name the notebook

Thanks for your patience as we work through these initial Autograder kinks!

Task 1: Datasheets for Datasets reading

So much goes into creating a dataset, and data is more than numbers and words in a file. Without a proper understanding of the whole context where data was created, biases, omissions, and inacuracies can go undetected. The Datasheets for Datasets [1] framework advocates for transparency about the purpose and contents of datasets.

Check out this short interview with lead author Dr. Timnit Gebru, the executive director of the Distributed Artificial Intelligence Research Institute (DAIR), on the motivation to write this article:

Read the paper and write a one-paragraph (between 100 and 150 words) reflection about it. Review the rubric for this assignment here. Answer at least one of the following questions for your reflection:

Can you think of a dataset you have worked with or encountered in your studies that would have benefited from a datasheet? Explain why or why not, using specific details about the dataset’s context, collection methods, or biases.
What do you think are the limitations of the datasheets framework? Are there any challenges or risks associated with this approach, and how might they be addressed in practical settings?
How does the topic of transparency in datasets relate to your understanding of ethical data science practices? Provide an example where increased transparency could have changed the outcome of a dataset you have used or read about.
Based on your previous professional experience, if you were tasked with creating a dataset for a project, what challenges or decisions would you face when creating its datasheet? Reflect on one or two aspects of data collection or transparency that you feel are particularly important.

Ready to submit your answer? Make sure your submission follows the checklist at the top of the assginment!

Setup for tasks 2 and 3

Fork this repository: https://github.com/MEDS-eds-220/eds220-hwk1
In the workbench-1 server, start a new JupyterLab session or access an active one.
Using the terminal, clone your eds220-hwk1 repository into your eds-220 directory.
In the terminal, use cd to navigate into the eds-220-hwk1 directory. Use pwd to verify eds-220-hwk1 is your current working directory.

Task 2: Exploring coral diversity data

For this task we are going to use data about Western Indian Ocean Coral Diversity [2] stored in the the Knowledge Network for Biocomplexity (KNB) data repository. The author for this dataset is Dr. Tim McClanahan, senior conservation zoologist at Wildlife Conservation Society.

Dr. Tim McClanahan underwater surveying coral reefs in coastal Tanzania. Photo credit: ©Michael Markovina. From the online article How Mount Kilimanjaro and We Can Save Corals

Follow the instructions in the notebook hwk1-task2-corals.ipynb to complete this task. Review the rubric for this assignment here. In this task you will practice:

preliminary data exploration
accessing data using a URL from a data archive
selecting data from a data frame
basic git workflow
commenting your code

Ready to submit your answers? Make sure your submission follows the checklist at the top of the assginment!

Task 3: `pandas` fundamentals with earthquake data

This task is adapted from the Pandas Fundamentals with Earthquake Data assignment from the e-book Earth and Environmental Data Science [3].

You will use simplified data from the USGS Earthquakes Database.

Follow the instructions in the notebook hwk1-task3-earthquakes.ipynb to complete this task.Review the rubric for this assignment here. Here you will practice:

accessing data from your directory
selecting data from a data frame
creating exploratory graphs
basic git workflow
commenting your code

Ready to submit your answers? Make sure your submission follows the checklist at the top of the assginment!

References

[1]

T. Gebru et al., “Datasheets for datasets,” Commun. ACM, vol. 64, no. 12, pp. 86–92, Nov. 2021, doi: 10.1145/3458723. Available: https://doi.org/10.1145/3458723

[2]

T. McClanahan, “Western Indian Ocean Coral Diversity.” KNB Data Repository, 2023. doi: 10.5063/F1K35S3H. Available: https://knb.ecoinformatics.org/view/doi:10.5063/F1K35S3H. [Accessed: Sep. 16, 2024]

[3]

R. Abernathey, K. Key, T. Crone, and J. Busecke, An Introduction to Earth and Environmental Data Science#. Available: https://earth-env-data-science.github.io/intro.html. [Accessed: Sep. 16, 2024]

Assignment 1

Submission instructions

Task 1: Datasheets for Datasets reading

Setup for tasks 2 and 3

Task 2: Exploring coral diversity data

Task 3: pandas fundamentals with earthquake data

References

Task 3: `pandas` fundamentals with earthquake data