How to use Snakemake expand

Snakemake's expand function is key for efficiently managing file names and paths in Snakemake, especially in workflows that handle NGS data. It combines patterns with variable values to create lists, useful in defining rule inputs and outputs.

In this deep dive, we'll look closer at Snakemake's expand functionality and usage.

How expand Works

  1. You define a pattern using braces {} where wildcards will be inserted. For example, "{dataset}/a.{extension}" is a pattern where dataset and extension are wildcards.

In Snakemake, this looks like

expand("{dataset}/a.{extension}", dataset=DATASETS, extension=EXTENSIONS)
  1. Snakemake substitutes the wildcard values from the lists you provide. If DATASETS = ['ds1', 'ds2'] and EXTENSIONS = ['txt', 'csv'], it replaces {dataset} and {extension} with each item from these lists.

  2. expand returns a list of strings generated by substituting wildcards in the pattern. With our example, you'll get a list like ["ds1/a.txt", "ds1/a.csv", "ds2/a.txt", "ds2/a.csv", ...].


By default, Snakemake uses the Cartesian product of the wildcard lists, combining every possible pair of items from DATASETS and EXTENSIONS in our example. You can replace the default function with other combinatorial functions, like zip.


Basic usage of expand

This uses expand to generate file paths with only one wildcard, dataset:

DATASETS = ["dataset1", "dataset2"]

expand("{dataset}/a.txt", dataset=DATASETS)

Results in:

["dataset1/a.txt", "dataset2/a.txt"]

expand with multiple wildcards

Adding another wildcard, extension produces the Cartesian product of the two wildcard datasets:

DATASETS = ["dataset1", "dataset2"]
EXTENSIONS = ["txt", "csv"]

expand("{dataset}/a.{extension}", dataset=DATASETS, extension=EXTENSIONS)

Results in:

["dataset1/a.txt", "dataset1/a.csv", "dataset2/a.txt", "dataset2/a.csv"]

Using lists as the first argument to expand

You can provide a list of patterns as the first argument:

DATASETS = ["dataset1", "dataset2"]
EXTENSIONS = ["txt", "csv"]

expand(["{dataset}/a.{extension}", "{dataset}/b.{extension}"], dataset=DATASETS, extension=EXTENSIONS)

Results in:

["dataset1/a.txt", "dataset1/b.txt", "dataset2/a.txt", "dataset2/b.txt",
"dataset1/a.csv", "dataset1/b.csv", "dataset2/a.csv", "dataset2/b.csv"]

Using custom functions with expand

Change the combination behavior by using a different combinatoric function like zip:

DATASETS = ["dataset1", "dataset2"]
EXTENSIONS = ["txt", "csv"]

expand(["{dataset}/a.{extension}", "{dataset}/b.{extension}"], zip, dataset=DATASETS, extension=EXTENSIONS)

Results in:

["dataset1/a.txt", "dataset1/b.txt", "dataset2/a.csv", "dataset2/b.csv"]

Masking wildcards

Using double braces {{}}, you can keep a part of the pattern as a wildcard for later resolution. This is useful when you want to generate file names with wildcard placeholders to be filled in later stages of the workflow.

To keep a wildcard expression as a literal, wrap it in another set of curly braces ({{}}):

expand("{{dataset}}/a.{ext}", ext=EXTENSIONS)

Results in:

["{dataset}/a.txt", "{dataset}/a.csv"]

which can then be used as an input list for another wildcard expansion.


Debug heavily with print statements

If your expand statements get complicated, print the output before using it in a rule. This can help you catch any unexpected combinations or errors in the list generation.

You can combine wildcards with constants

If you have a constant part of a file name, include it directly in the string:

SAMPLES = ['a', 'b']

expand("project_one_sample_{dataset}.fastq", samples=SAMPLES)