If you're building a normal data pipeline with Python, Airflow is the de facto choice: it was adopted by the Apache project after its inception at Airbnb, and it has thousands of contributors.
If you're building a bioinformatics data pipeline in Python, the choice is less clear. Despite a massive gap in broad popularity, the smaller Snakemake is the more popular choice. Even though Airflow has over 10x more stars than Snakemake on GitHub, around 40% of bioinformatics pipelines are written in Snakemake – with Airflow wavering between 1-5%. If you're not sure why bioinformatics workflows are different and why this discrepancy exists, I highly recommend Ben Siranosian's blog post on the topic.
At a glance, Airflow's maturity, ecosystem of integrations, and a best-practices implementation win over many developers, but others shy away from Airflow due to its steep learning curve and complexity. Snakemake attracts those looking for a bioinformatics-specific approach, but it lacks the broad scope and depth of the Airflow ecosystem.
Snakemake pipelines are built around input and output files, which makes building a bioinformatics pipeline easier. In comparison, Airflow treats files as side effects and expects an explicitly structured pipeline – using more complex architecture and syntax.
Anecdotally, it seems like whether you see yourself as a software engineer or bioinformatician predicts which of the two workflow managers you choose.
Do you see yourself as a software engineer? Are you uninterested in exploring bioinformatics?
Between Airflow and Snakemake, Airflow is probably the better choice. It's used extensively outside of bioinformatics, and it adheres to standard software engineering practices more consistently than Snakemake.
Do you see yourself as a bioinformatician? Do you want something built for your day-to-day?
Snakemake is probably the better choice. It's much easier to use for bioinformatics, especially for a small team without infrastructure engineering experience.
Do you see yourself as both a software engineer and a bioinformatician? Or are you still unsure about what makes the most sense for you?
Common benefits and limitations
Both Airflow and Snakemake help with writing reliable pipelines, especially when the alternative is a collection of cobbled together Python and shell scripts.
Both also have limitations. Some of this comes from the inherent complexity in building a pipeline, but some complexity comes from similar design choices made by the creators of both Airflow and Snakemake.
Benefits of using Airflow or Snakemake
- Your pipeline is nearly guaranteed to be more reliable than using pure Python.
- Support for reproducible environments through containerization (e.g. using Docker images).
- Both are open-source with active communities.
- Both are built with Python, which has the inherent benefits of Python.
Limitations of using Airflow and Snakemake
- It's difficult to use either for dynamically generated workflows, especially comparing Airflow to Prefect or Snakemake to Nextflow.
- Both have built-in support for offloading tasks to the cloud, but robustness is dependent on your scale and implementation.
- Built-in observability and developer tooling leaves room for improvement.
- They both use Python, which has the inherent limitations of Python.
Dealbreakers, slowdowns, unique benefits, and quirks
Airflow is the most popular and established workflow manager, with thousands of contributors and Apache backing. It's popular across many software engineering domains, and it isn't specific to bioinformatics.
- Airflow has a steep learning curve and complex infrastructure setup.
- Its infrastructural complexity makes sharing pipelines between different environments difficult.
- There's a lack of support for simple cluster execution (e.g. with SLURM).
- Writing standard bioinformatics pipelines often requires additional tooling in addition to Airflow.
- Task execution has a high latency overhead; running hundreds of seconds-long tasks doesn't work well with Airflow.
- Airflow lacks first-class Conda support.
Unique benefits of Airflow
- There's strong community support, detailed documentation, and many unofficial tutorials.
- A deep set of features and integrations, including enterprise features like audit logs and authentication – on which Snakemake doesn't come close.
- The longer history of companies using Airflow in production has ironed out most operational issues.
- Setup and maintenance cost is high due to its complexity and breadth.
- Configuring parallelism is comparably opaque and easy to get wrong.
- Airflow uses seperate executors to deploy tasks, and the three most common executors at scale – Celery, Dask, and Kubernetes – can add significant complexity.
- There are tight Google Cloud and Astronomer integrations, but there aren't many other managed cloud options.
Dealbreakers, slowdowns, unique benefits, and quirks
Snakemake is the spiritual child of Python and GNU make (hence the name "snakemake"). Pipelines are structured similarly to GNU makefiles, but with additional features build with bioinformatics pipelines in mind.
- Snakemake is a smaller quirkier project than Airflow with fewer contributors.
- Pipelines are defined by the files each step produces, which is clunky for steps that don't produce files.
- Built-in execution support is more polished for cluster execution than cloud execution. There's no native support for running pipelines in a managed service in the cloud (which we patch with FlowDeploy).
An aside: cloud and cluster support with Snakemake
A few years ago, cloud and cluster support was much better in Nextflow, and many chose Nextflow over Snakemake solely for that reason. The ecosystem around Snakemake has caught up since then, and it's no longer a clear dealbreaker.
- Documentation lacks short and straightforward examples.
- Some rough or "beta quality" implementations, especially when compared to Airflow.
Unique benefits of Snakemake
- Its smaller learning curve and simplicity make it easer to get started and share pipelines with others.
- Snakemake has a bioinformatics-first approach, with features that make common bioinformatics tasks easier:
- First-class support for popular bioinformatics tools that engineers deem "hacky", such as Jupyter notebooks, shell scripts, and R scripts.
- Files-first approach that naturally fits with NGS-driven bioinformatics pipelines.
- Integrations with common bioinformatics tools like aligners, assemblers, classifiers, etc.
- Pre-built Snakemake Workflows for common bioinformatics use-cases.
- Built-in task caching makes retrying failed pipelines significantly faster.
- Snakemake's pipeline definition has a familiar structure for those who are familiar with GNU make.
- It's maintained by an academic group in Germany. That's a very different structure from the Airbnb-born and Apache-backed Airflow.
- Snakemake doesn't really follow expected software engineering practices, and lends support to patterns that are considered hacky – such as running a Jupyter notebook inside a Conda environment as a pipeline step. That helps bioinformaticians build pipelines faster, but it's anti-pattern for software engineering and can make maintaining pipelines a pain.
- There's an over-reliance on Conda and Singularity for dependency management. This makes sense in an academic context, but isn't the norm in industry.
- Limited UI options.
Making bioinformatics workflow managers like Snakemake more capable
FlowDeploy makes bioinformatics workflow managers like Snakemake cloud native. It's agnostic to your choice of workflow language, but adds custom-tailored compute, data management, and observability features.
Disclosure: this section promotes our paid product, but I genuinely believe it changes the Airflow vs. Snakemake calculus.
Eliminates some deal breakers
- Adds cloud deployments at scale for Snakemake. (This is the only fully managed way to run Snakemake in the cloud, as of this article, without losing key benefits of Snakemake.)
- Support, with an SLA.
- It only works with cloud computing. That means it doesn't work for an on-prem SLURM cluster (yet).
- It's a paid tool, unless you manage your own infrastructure under the hood and are in academia.
The debate between Airflow and Snakemake largely boils down to an established workflow manager that follows software engineering best practices (Airflow) vs. a domain specific workflow manager for bioinformatics (Snakemake). Both are used in production for running bioinformatics pipelines, so the choice largely comes down to personal preference.
The best way to decide what to use is to try them out!