Workflows and Pipelines

Modified

August 1, 2024

Introduction

Workflows and pipelines are essential tools for automating and streamlining data analysis and computational tasks. They help researchers and developers manage complex processes, track dependencies, and ensure reproducibility. By defining a series of steps and dependencies, workflows and pipelines can be executed sequentially or in parallel, enabling efficient data processing and analysis.

For a list of possible tools and frameworks for building workflows and pipelines, see the Awesome Pipelines list.

Below are some tool and frameworks utilized by NIEHS scientific developers.

targets R Package

The targets package is a Make-like pipeline tool for statistics and data science in R. The package skips costly runtime for tasks that are already up to date, orchestrates the necessary computation with implicit parallel computing, and abstracts files as R objects. If all the current output matches the current upstream code and data, then the whole pipeline is up to date, and the results are more trustworthy than otherwise.1

For documentation surrounding the targets R package, see the The {targets} R package user manual.

snakemake Workflow Management System

The Snakemake workflow management system is a tool to create reproducible and scalable data analyses. Workflows are described via a human readable, Python based language. They can be seamlessly scaled to server, cluster, grid and cloud environments, without the need to modify the workflow definition. Finally, Snakemake workflows can entail a description of required software, which will be automatically deployed to any execution environment.2

For documentation surrounding the snakemake workflow management system, see the Snakemake Documentation.

Nextflow Data-driven Computational Pipelines

Nextflow enables scalable and reproducible scientific workflows using software containers. It allows the adaptation of pipelines written in the most common scripting languages. Its fluent DSL simplifies the implementation and the deployment of complex parallel and reactive workflows on clouds and clusters. .3

For documentation surrounding the Nextflow data-driven computational pipelines, see the Nextflow Documentation.

Back to top