Assignment 1

This assignment covers topics in the notes from the Python review to the plotting with pandas lesson. Task 1 will contribute 20% to the total grade of the assignment and tasks 2 and 3 will contribute 40% each.

Submission instructions

This assignment is due by 11:59 pm on Saturday, September 12. All tasks for this assignment should be submitted via Gradescope. Make sure you double-check your submission to ensure it satisfies all the items in this checklist:

Resubmissions after the due date due to not satisfying one of the checks above will be strictly held to the course’s 50%-regrade resubmission policy (see syllabus).

If you have any questions about assignment logistics, please reach out to the TA or instructor by 5 pm Friday, September 11.

Rename homework notebooks before uploading them to Gradescope

For your upcoming assignment submission, you’ll be downloading your notebooks and then uploading them to Gradescope. Before you upload your finished notebooks to Gradescope, please rename your notebooks so they are called

  • hwk1-task2-corals-YOURLASTNAME.ipynb and
  • hwk1-task3-earthquakes-YOURLASTNAME.ipynb.

It’s important to do this so we can keep track of resubmissions.

Thanks!

Updates to Gradescope’s autograder

Here’s updates about how auto-grading will work in this first assignment:

  • If you want to know your autograder score at any point, you may upload your notebook to the Homework 1 Task 2 - AUTOGRADER CHECK ONLY or Homework 1 Task 3 - AUTOGRADER CHECK ONLY assignments on gradescope.
    • Once you submit your assignment, you will be able to see your total score for the auto-grading, not the score for individual questions.
    • If you don’t have a 20/20 score in your auto-grade questions, it means there is some mistake with your code and you should go back and review it. If you can’t figure out where the issue is, discuss it with other people (first option always!), come see Annie or Carmen during OH, or use Slack.
  • The AUTOGRADER CHECK ONLY assignments on gradescope are strictly for you to see how you did on the assignment. We will not be using these grades at all
  • You must still submit your final assignment to the Homework 1- Task 2 - Corals and Homework 1 - Task 3 - Earthquakes assignment
  • Make sure you’re keeping up with your classmate’s questions and answers on Slack.
  • When submitting your final notebook, please make sure to follow the instructions above regarding how to name the notebook

Thanks for your patience as we work through these initial Autograder kinks!

Task 1: Datasheets for Datasets reading

So much goes into creating a dataset, and data is more than numbers and words in a file. Without a proper understanding of the whole context where data was created, biases, omissions, and inacuracies can go undetected. The Datasheets for Datasets [1] framework advocates for transparency about the purpose and contents of datasets.

Check out this short interview with lead author Dr. Timnit Gebru, the executive director of the Distributed Artificial Intelligence Research Institute (DAIR), on the motivation to write this article:

Read the paper and write a one-paragraph (between 100 and 150 words) reflection about it. Review the rubric for this assignment here. Answer at least one of the following questions for your reflection:

  1. Can you think of a dataset you have worked with or encountered in your studies that would have benefited from a datasheet? Explain why or why not, using specific details about the dataset’s context, collection methods, or biases.

  2. What do you think are the limitations of the datasheets framework? Are there any challenges or risks associated with this approach, and how might they be addressed in practical settings?

  3. How does the topic of transparency in datasets relate to your understanding of ethical data science practices? Provide an example where increased transparency could have changed the outcome of a dataset you have used or read about.

  4. Based on your previous professional experience, if you were tasked with creating a dataset for a project, what challenges or decisions would you face when creating its datasheet? Reflect on one or two aspects of data collection or transparency that you feel are particularly important.

Ready to submit your answer? Make sure your submission follows the checklist at the top of the assginment!

Setup for tasks 2 and 3

  1. Fork this repository: https://github.com/MEDS-eds-220/eds220-hwk1

  2. In the workbench-1 server, start a new JupyterLab session or access an active one.

  3. Using the terminal, clone your eds220-hwk1 repository into your eds-220 directory.

  4. In the terminal, use cd to navigate into the eds-220-hwk1 directory. Use pwd to verify eds-220-hwk1 is your current working directory.

Task 2: Exploring coral diversity data

For this task we are going to use data about Western Indian Ocean Coral Diversity [2] stored in the the Knowledge Network for Biocomplexity (KNB) data repository. The author for this dataset is Dr. Tim McClanahan, senior conservation zoologist at Wildlife Conservation Society.

Dr. Tim McClanahan underwater surveying coral reefs in coastal Tanzania. Photo credit: ©Michael Markovina. From the online article How Mount Kilimanjaro and We Can Save Corals

Follow the instructions in the notebook hwk1-task2-corals.ipynb to complete this task. Review the rubric for this assignment here. In this task you will practice:

  • preliminary data exploration
  • accessing data using a URL from a data archive
  • selecting data from a data frame
  • basic git workflow
  • commenting your code

Ready to submit your answers? Make sure your submission follows the checklist at the top of the assginment!

Task 3: pandas fundamentals with earthquake data

This task is adapted from the Pandas Fundamentals with Earthquake Data assignment from the e-book Earth and Environmental Data Science [3].

You will use simplified data from the USGS Earthquakes Database.

Follow the instructions in the notebook hwk1-task3-earthquakes.ipynb to complete this task.Review the rubric for this assignment here. Here you will practice:

  • accessing data from your directory
  • selecting data from a data frame
  • creating exploratory graphs
  • basic git workflow
  • commenting your code

Ready to submit your answers? Make sure your submission follows the checklist at the top of the assginment!

References

[1]
T. Gebru et al., “Datasheets for datasets,” Commun. ACM, vol. 64, no. 12, pp. 86–92, Nov. 2021, doi: 10.1145/3458723. Available: https://doi.org/10.1145/3458723
[2]
T. McClanahan, “Western Indian Ocean Coral Diversity.” KNB Data Repository, 2023. doi: 10.5063/F1K35S3H. Available: https://knb.ecoinformatics.org/view/doi:10.5063/F1K35S3H. [Accessed: Sep. 16, 2024]
[3]
R. Abernathey, K. Key, T. Crone, and J. Busecke, An Introduction to Earth and Environmental Data Science#. Available: https://earth-env-data-science.github.io/intro.html. [Accessed: Sep. 16, 2024]