How to use Snakemake wildcards
Snakemake wildcards, a deep dive
Wildcards in Snakemake are key for making a pipeline's rules work across datasets and file names. In this deep dive, we'll look closer at Snakemake's wildcard functionality and usage.
How wildcards work
The basics
In Snakemake, wildcards are most often used to generalize rules, making them adaptable for various filenames. With wildcards, rules can match a broad range of file patterns without restricting to specific file names.
Mechanism
When a wildcard-enabled rule is included in a pipeline's DAG, Snakemake dynamically assigns values to the placeholders based on the required output. The assignment happens during the "DAG creation" phase: after the rules have been parsed by Snakemake during the "initialization" phase.
Wildcard placeholders can be used in rule directives (e.g. input: data/{sample}.fastq
) and are evaluated during
DAG creation. To access the value of a placeholder (e.g. if ({wildcards.sample}) ...
), you need to use a function.
Wildcards can be constrained using regular expressions, which limits what they match.
Examples
Basic wildcard usage
rule all:
input:
"results/{sample}.txt"
rule process_sample:
input:
"data/{sample}.fastq"
output:
"results/{sample}.txt"
shell:
"process_data {input} > {output}"
In this basic example, {sample}
is assigned the value of each file's name in results/*.txt
– excluding the extension
"txt".
Constraining wildcards
wildcard_constraints:
sample="\d+"
rule process_sample:
input:
"data/{sample}.fastq"
output:
"results/{sample}.txt"
shell:
"process_data {input} > {output}"
The constraint can also be added immediately after the wildcard name, as {sample,\d+}
.
By defining sample="\d+"
, {sample}
matches only digits, limiting while files are processed to sample names purely
composed of numbers.
Multiple wildcards, and using wildcards in the shell
directive
rule process_sample:
input:
"data/{sample}.{extension}"
output:
"results/{sample}.txt"
shell:
"process_data --filetype {wildcards.extension} < {input} > {output}"
You can have multiple wildcards in the same rule, and can access them from the shell
directive.
Using wildcard values with input functions
Sometimes, the simple {some_name}
wildcard placeholder syntax is insufficient for making rules generic across
file names. Input functions, which are passed a global wildcards object, fix this problem.
Input functions are evaluated during the "DAG phase" of execution: after initialization and determination of wildcard values, but before job submission. This means input functions can access the wildcard values for an instance of a rule.
In this example, we pass a function to the input directive; Snakemake then passes a wildcards
argument, which contains
the values of all wildcards, to the get_sample_fastq
function.
[..]
def get_sample_fastq(wildcards):
# Assume find_fastq_path is a function returns the FASTQ file path for any sample
return find_fastq_path(wildcards.sample)
rule process_sample:
input:
get_sample_fastq
output:
"results/{sample}.txt"
shell:
"process_data {input} > {output}"
This only works if the input function returns a valid input for every sample name. If you want a rule to run for only some samples, you'll need to use a Snakemake checkpoint.
Tips
Avoid as much ambiguity as you can in wildcards
To avoid matching the wrong files, reduce ambiguity by adding specifity to your wildcard regex. For instance, if you
know the file must include a four digit year, constrain the wildcard to that year format (e.g. "{date,\d{4}-.*}"
).
You can debug wildcard values with echo
echo
is a hacky but quick way to debug wildcard values. For example:
shell:
"echo Processing dataset: {wildcards.dataset}"