Workflows and Pipelines
Introduction
Workflows and pipelines are essential tools for automating and streamlining data analysis and computational tasks. They help researchers and developers manage complex processes, track dependencies, and ensure reproducibility. By defining a series of steps and dependencies, workflows and pipelines can be executed sequentially or in parallel, enabling efficient data processing and analysis.
For a list of possible tools and frameworks for building workflows and pipelines, see the
Below are some tool and frameworks utilized by NIEHS scientific developers.
targets
R Package
The targets package is a Make-like pipeline tool for statistics and data science in R. The package skips costly runtime for tasks that are already up to date, orchestrates the necessary computation with implicit parallel computing, and abstracts files as R objects. If all the current output matches the current upstream code and data, then the whole pipeline is up to date, and the results are more trustworthy than otherwise.1
For documentation surrounding the targets
R package, see the The {targets} R package user manual.
snakemake
Workflow Management System
The Snakemake workflow management system is a tool to create reproducible and scalable data analyses. Workflows are described via a human readable, Python based language. They can be seamlessly scaled to server, cluster, grid and cloud environments, without the need to modify the workflow definition. Finally, Snakemake workflows can entail a description of required software, which will be automatically deployed to any execution environment.2
For documentation surrounding the snakemake
workflow management system, see the Snakemake Documentation.
Nextflow
Data-driven Computational Pipelines
Nextflow enables scalable and reproducible scientific workflows using software containers. It allows the adaptation of pipelines written in the most common scripting languages. Its fluent DSL simplifies the implementation and the deployment of complex parallel and reactive workflows on clouds and clusters. .3
For documentation surrounding the Nextflow
data-driven computational pipelines, see the Nextflow Documentation.