Skip to main content

How to use Snakemake wildcards

Snakemake wildcards, a deep dive

Wildcards in Snakemake are key for making a pipeline's rules work across datasets and file names. In this deep dive, we'll look closer at Snakemake's wildcard functionality and usage.

How wildcards work

The basics

In Snakemake, wildcards are most often used to generalize rules, making them adaptable for various filenames. With wildcards, rules can match a broad range of file patterns without restricting to specific file names.

Mechanism

When a wildcard-enabled rule is included in a pipeline's DAG, Snakemake dynamically assigns values to the placeholders based on the required output. The assignment happens during the "DAG creation" phase: after the rules have been parsed by Snakemake during the "initialization" phase.

Wildcard placeholders can be used in rule directives (e.g. input: data/{sample}.fastq) and are evaluated during DAG creation. To access the value of a placeholder (e.g. if ({wildcards.sample}) ...), you need to use a function.

Wildcards can be constrained using regular expressions, which limits what they match.

Examples

Basic wildcard usage

rule all:
input:
"results/{sample}.txt"

rule process_sample:
input:
"data/{sample}.fastq"
output:
"results/{sample}.txt"
shell:
"process_data {input} > {output}"

In this basic example, {sample} is assigned the value of each file's name in results/*.txt – excluding the extension "txt".

Constraining wildcards

wildcard_constraints:
sample="\d+"

rule process_sample:
input:
"data/{sample}.fastq"
output:
"results/{sample}.txt"
shell:
"process_data {input} > {output}"
Constraint shorthand syntax

The constraint can also be added immediately after the wildcard name, as {sample,\d+}.

By defining sample="\d+", {sample} matches only digits, limiting while files are processed to sample names purely composed of numbers.

Multiple wildcards, and using wildcards in the shell directive

rule process_sample:
input:
"data/{sample}.{extension}"
output:
"results/{sample}.txt"
shell:
"process_data --filetype {wildcards.extension} < {input} > {output}"

You can have multiple wildcards in the same rule, and can access them from the shell directive.

Using wildcard values with input functions

Sometimes, the simple {some_name} wildcard placeholder syntax is insufficient for making rules generic across file names. Input functions, which are passed a global wildcards object, fix this problem.

Input functions are evaluated during the "DAG phase" of execution: after initialization and determination of wildcard values, but before job submission. This means input functions can access the wildcard values for an instance of a rule.

In this example, we pass a function to the input directive; Snakemake then passes a wildcards argument, which contains the values of all wildcards, to the get_sample_fastq function.

[..]

def get_sample_fastq(wildcards):
# Assume find_fastq_path is a function returns the FASTQ file path for any sample
return find_fastq_path(wildcards.sample)

rule process_sample:
input:
get_sample_fastq
output:
"results/{sample}.txt"
shell:
"process_data {input} > {output}"

This only works if the input function returns a valid input for every sample name. If you want a rule to run for only some samples, you'll need to use a Snakemake checkpoint.

Tips

Avoid as much ambiguity as you can in wildcards

To avoid matching the wrong files, reduce ambiguity by adding specifity to your wildcard regex. For instance, if you know the file must include a four digit year, constrain the wildcard to that year format (e.g. "{date,\d{4}-.*}").

You can debug wildcard values with echo

echo is a hacky but quick way to debug wildcard values. For example:

shell:
"echo Processing dataset: {wildcards.dataset}"