SET group syllabus

Published

July 28, 2024

Welcome to the Spatiotemporal Exposures and Toxicology group

Welcome to the Spatiotemporal Exposures and Toxicology group (a.k.a. SET group). This is a document intended for current and prospective members of the SET group at the National Institute of Environmental Health Sciences (NIEHS). As an investigator, it is my job to ensure you have the resources and support you need to succeed in your research. This syllabus is a living document that outlines the expectations and resources available to you as a member of the group. Please make suggestions for improvements and additions to this document through GitHub pull requests.

Sincerely,

Kyle P Messier, PhD

General On-Boarding Steps

  1. Get NIH Badge (Background Check, Fingerprints, VISA if applicable)
  2. Get NIH Email
  3. Get NIH Computer
  4. Setup HPC account with NIEHS OSC
  5. Setup GitHub account with NIEHS Organization
  6. Develop Individual Development Plan (eIDP)

Projects

  1. Fellows in the group are expected to have a primary project with the goal of a first author publication
  2. Fellows are encouraged to collaborate on secondary projects with each other
  3. The group has an on-going group project that everyone is expected to contribute to

Projects should be within the SET group’s research themes of (1) geospatial analysis, (2) mixtures and exposomic predictions, and (3) source-to-outcome-continuum or GeoTox modeling. However, I encourage everyone to provide their own unique perspective and expertise to these projects. Ultimately, a project will be successful if it is of great interest to you and you are passionate about it.

Project Management

We utilize GitHub Projects to manage our projects. Currently, each project has a project board with the following columns:

  1. New Issues: General ideas and tasks that need to be fleshed out
  2. Backlog: Fleshed out ideas and tasks that need to be done. Priority is set by tasks at the top of the backlog.
  3. In Progress: Tasks that are currently being worked on. A best-practice is to minimize the number of tasks in this column.
  4. Review: Tasks that are ready for review by a collaborator
  5. Done: Completed tasks

GitHub Project tasks can be directly linked to issues on the corresponding GitHub repository. This allows for a seamless integration of project management and issue tracking.

Software Installations

  1. R, Rstudio
  2. Python, VS Code
  3. Git
  4. Quarto
  5. QGIS
  6. Adobe Illustrator (optional)
  7. mySQL (optional)
  8. PostgresSQL (optional)

eIDP (Individual Development Plans)

In the first few weeks to months as a fellow, we will develop an eIDP. This will be a living document that outlines your career goals and the steps you will take to achieve them. We will review this document quarterly and update it as needed.

Software Best Practices

We are focused on developing and promoting software and computational best-practices such as test-driven-development (TDD) and open-source code for the environmental health sciences. To this end, we have protocols in place to ensure that our code is well-documented, tested, and reproducible. Below are some of the key practices we follow:

Unit and Integration Testing

We will utilize various testing approaches to ensure functionality and quality of code.

Git + GitHub

Version control of software is essential for reproducibility and collaboration. We use Git and the NIEHS Enterprise GitHub for version control and collaboration.

CI/CD Workflows

Within GitHub, we will utilize continuous integration and continuous deployment workflows to ensure that our code is always functional and up-to-date. Multiple ** branch protection rules** within GitHub aresetup and enforced for our GitHub repositories:

  1. Require pull request and 1 review before merging to main
  2. Test pass: Linting: Code shall adhere to the style/linting rules defined in the repository.
  3. Test pass: Tets Coverage: A given push should not decrease overall test coverage of the repository.
  4. Test pass: Build Checks: The code should build without errors or warnings.

The CI/CD workflows in GitHub are setup to run on every push to the main branch and on every pull request. The workflows are setup using yaml files in the .github/workflows directory of the repository.

Processes to test or check

  1. data type
  2. data name
  3. data size
  4. relative paths
  5. output of one module is the expectation of the input of the next module

Test Drive Development

Starting from the end product, we work backwards while articulating the tests needed at each stage.

Key Points of Unit and Integration Testing

File Type
  1. NetCDF
  2. Numeric, double precision
  3. NA
  4. Variable Names Exist
  5. Naming Convention
Stats
  1. Non-negative variance (\(\sigma^2\))
  2. Mean is reasonable (\(\mu\))
  3. SI Units
Domain
  1. In the geographic domain (eg. US + buffer)
  2. In Time range (e.g. 2018-2022)
Geographic
  1. Projections
  2. Coordinate names (e.g. lat/lon)
  3. Time in acceptable format

Test Driven Development (TDD)- Key Steps

  1. Write a Test: Before you start writing any code, you write a test case for the functionality you want to implement. This test should fail initially because you haven’t written the code to make it pass yet. The test defines the expected behavior of your code.

  2. Run the Test: Run the test to ensure it fails. This step confirms that your test is correctly assessing the functionality you want to implement.

  3. Write the Minimum Code: Write the minimum amount of code required to make the test pass. Don’t worry about writing perfect or complete code at this stage; the goal is just to make the test pass.

  4. Run the Test Again: After writing the code, run the test again. If it passes, it means your code now meets the specified requirements.

  5. Refactor (if necessary): If your code is working and the test passes, you can refactor your code to improve its quality, readability, or performance. The key here is that you should have test coverage to ensure you don’t introduce new bugs while refactoring.

  6. Repeat: Continue this cycle of writing a test, making it fail, writing the code to make it pass, and refactoring as needed. Each cycle should be very short and focused on a small piece of functionality.

  7. Complete the Feature: Keep repeating the process until your code meets all the requirements for the feature you’re working on.

TDD helps ensure that your code is reliable and that it remains functional as you make changes and updates. It also encourages a clear understanding of the requirements and promotes better code design.

_targets and/or snakemake pipelines

We will utilize the targets and/or snakemake packages in R and Python respectively to create reproducible workflows for our data analysis. These packages allow us to define the dependencies between the steps in our analysis and ensure that our analysis is reproducible. Additionally, they keep track of pipeline objects and skip steps that have already been run, saving time and resources.

Some Benefits of _targets and/or snakemake pipelines

  1. Reproducibility: By defining the dependencies between the steps in our analysis, we ensure that our analysis is reproducible. This is essential for scientific research and data analysis.

  2. High-Level Abstract: _targets and snakemake allow us to define our analysis at a high level of abstraction, making it easier to understand and maintain.

  3. Testing: Creating pipelines and unit/integration testing go hand-in-hand together. As we write the pipeline, the tests to write become obvious.

Writing and Presentations

  1. Overleaf. I have a template for manuscripts and presentations. This is a good option for writing when there are a decent amount of equations.
  2. Quarto. I have a template for manuscripts and presentations. Quarto RevealJS and Beamer make pretty presentations that can be easily converted to HTML or PDF and shareable on our websites.

We will use these tools to write manuscripts and presentations.

Conferences

We attend one to three major research society conferences per year. These conferences are a great opportunity to present our research, network with other researchers, and learn about the latest developments in our field.

The conferences can vary year-to-year depending on the research of each person, but some of the conferences we typically attend include:

  1. American Geophysical Union (AGU)
  2. Society of Toxicology (SOT)
  3. International Society for Environmental Epidemiology (ISEE)
  4. International Society for Exposure Science (ISES)

Group Meetings

We have 1 group meetings per week

Thursdays from 10-11:30am

The general format of the meetings is as follows:

  1. Check-in: Each member of the group will give a brief update on their progress and any issues they are facing

  2. Rubber Ducking: We’ll spend 10 - 20 minutes looking at someone’s code, so that everyone can help and also learn new techniques, tricks, packages, etc. This could include code in progress, code that is not working, efficiency improvements, or unit/integreation testing review.

  3. Logistics: We’ll discuss any upcoming deadlines, meetings, etc.

Attendance and Participation

Attendance and engagement in your own research and the research of others is critical to the success of the group. I expect you to attend all group meetings and participate in group activities. If you are unable to attend a meeting, please let me know in advance.

In general, we work in the office but we are flexible for teleworking days.

Time Management

I expect you to manage your time effectively and meet deadlines. If you are having trouble meeting a deadline, please let me know as soon as possible so we can work together to find a solution. There is no set number of hours per week that you are expected to work, but I expect you to be productive and engaged in your work.

Time Off

We follow the NIH guidelines for time off. However, in general, my philosophy is to take time off when you need it. I trust that you will get your work done and that you will communicate with me if you need time off. If you are getting work done, participating and engaging with group activities, and meeting deadlines, I am happy to be flexible with time off.

Required Traininings

  1. NIH Ethics Training: All members of the group are required to complete the NIH Ethics Training. This training covers topics such as conflict of interest, research misconduct, and ethical conduct in research. The training must be completed annually.

  2. NIH Information Security : All members of the group are required to complete the NIH Information Security Training. This training covers topics such as data security, privacy, and information handling. The training must be completed annually.

Resources

  1. NIH Office of Intramural Training and Education (OITE) https://www.training.nih.gov/
  2. Trainee and Fellow Policies. Includes health insurance, family leave, outside activities, etc. https://www.training.nih.gov/policies/trainee-fellow-policies/
  3. NIH Library https://www.nihlibrary.nih.gov/
  4. NIH Data Science Interest Group https://datascience.nih.gov/
  5. NIH Data Science Training https://datascience.nih.gov/data-science-training
  6. Becoming a Resilient Scientist https://www.training.nih.gov/becoming_a_resilient_scientist
  7. NIH Library Training https://www.nihlibrary.nih.gov/training