How to use Snakemake expand
Snakemake expand, a deep dive
Snakemake's expand
function is key for efficiently managing file names and paths in Snakemake, especially in workflows
that handle NGS data. It combines patterns with variable values to create lists, useful in defining rule inputs and
outputs.
In this deep dive, we'll look closer at Snakemake's expand
functionality and usage.
How expand
Works
- You define a pattern using braces
{}
where wildcards will be inserted. For example,"{dataset}/a.{extension}"
is a pattern wheredataset
andextension
are wildcards.
In Snakemake, this looks like
expand("{dataset}/a.{extension}", dataset=DATASETS, extension=EXTENSIONS)
-
Snakemake substitutes the wildcard values from the lists you provide. If
DATASETS = ['ds1', 'ds2']
andEXTENSIONS = ['txt', 'csv']
, it replaces{dataset}
and{extension}
with each item from these lists. -
expand
returns a list of strings generated by substituting wildcards in the pattern. With our example, you'll get a list like["ds1/a.txt", "ds1/a.csv", "ds2/a.txt", "ds2/a.csv", ...]
.
By default, Snakemake uses the Cartesian product of the wildcard
lists, combining every possible pair of items from DATASETS
and EXTENSIONS
in our example. You can replace the
default function with other combinatorial functions, like zip.
Examples
Basic usage of expand
This uses expand
to generate file paths with only one wildcard, dataset
:
DATASETS = ["dataset1", "dataset2"]
expand("{dataset}/a.txt", dataset=DATASETS)
Results in:
["dataset1/a.txt", "dataset2/a.txt"]
expand
with multiple wildcards
Adding another wildcard, extension
produces the Cartesian product of the two wildcard datasets:
DATASETS = ["dataset1", "dataset2"]
EXTENSIONS = ["txt", "csv"]
expand("{dataset}/a.{extension}", dataset=DATASETS, extension=EXTENSIONS)
Results in:
["dataset1/a.txt", "dataset1/a.csv", "dataset2/a.txt", "dataset2/a.csv"]
Using lists as the first argument to expand
You can provide a list of patterns as the first argument:
DATASETS = ["dataset1", "dataset2"]
EXTENSIONS = ["txt", "csv"]
expand(["{dataset}/a.{extension}", "{dataset}/b.{extension}"], dataset=DATASETS, extension=EXTENSIONS)
Results in:
["dataset1/a.txt", "dataset1/b.txt", "dataset2/a.txt", "dataset2/b.txt",
"dataset1/a.csv", "dataset1/b.csv", "dataset2/a.csv", "dataset2/b.csv"]
Using custom functions with expand
Change the combination behavior by using a different combinatoric function like zip:
DATASETS = ["dataset1", "dataset2"]
EXTENSIONS = ["txt", "csv"]
expand(["{dataset}/a.{extension}", "{dataset}/b.{extension}"], zip, dataset=DATASETS, extension=EXTENSIONS)
Results in:
["dataset1/a.txt", "dataset1/b.txt", "dataset2/a.csv", "dataset2/b.csv"]
Masking wildcards
Using double braces {{}}
, you can keep a part of the pattern as a wildcard for later resolution. This is useful when
you want to generate file names with wildcard placeholders to be filled in later stages of the workflow.
To keep a wildcard expression as a literal, wrap it in another set of curly braces ({{}}
):
expand("{{dataset}}/a.{ext}", ext=EXTENSIONS)
Results in:
["{dataset}/a.txt", "{dataset}/a.csv"]
which can then be used as an input list for another wildcard expansion.
Tips
Debug heavily with print statements
If your expand
statements get complicated, print the output before using it in a rule. This can help you catch any
unexpected combinations or errors in the list generation.
You can combine wildcards with constants
If you have a constant part of a file name, include it directly in the string:
SAMPLES = ['a', 'b']
expand("project_one_sample_{dataset}.fastq", samples=SAMPLES)