How to use Snakemake resources
Snakemake resources, a deep dive
Snakemake's "resources" determine how jobs are scheduled within a computational environment. Some resources are consistent across different execution environments, while others are specific to an executor.
In this deep dive, we'll look closer at Snakemake's "resources" functionality and usage. Because different executors have different options, we won't dive too deep into executor-specific options.
How resources work in Snakemake
- Define required computational resources like memory (
mem_mb
) or disk space (disk_mb
).- In a
Snakefile
, use theresources
rule attribute. - On the command line, use the
--resources
option. - In configuration files, use the
resources
keyword.
- In a
- You can also define custom resources for specific needs, like API call limits.
- Snakemake distinguishes between two types of resources
- Local Resources are specific to individual job or subtask submissions (e.g. memory, disk space). If a job is assigned 16GB of memory, that particular job will be reserved 16GB of memory.
- Global Resources apply across all jobs, and are helpful for global restrictions like API limits.
- Snakemake does its best to schedule resources within the requested resources, but does not actually restrict a task that exceeds its allocation.
Examples
Basic usage
rule align:
input: "data/sequences.fasta"
output: "results/aligned.fasta"
resources: mem_mb=32000
shell: "aligner --input {input} --output {output}"
This rule requests 32 GB of memory for a sequence alignment task.
Global vs. local resources
When workflows submit jobs remotely, Snakemake manages resources in two different blocks:
- Global resources apply across all jobs, even those executing on different machines.
- Local resources apply only to a single job.
By default, only mem_mb
, disk_mb
, and threads
are considered local resources. Any other resources, including new
resources, are considered global by default. You can change the scoping of a resource in resource_scopes
.
Snakefile:
resource_scopes:
mem_mb="global"
rule download_data:
output: "data/raw_data.csv"
resources:
mem_mb=1000
shell: "download_script {output}"
Snakemake run:
snakemake --resources mem_mb=5000
In this example, the scope of mem_mb
is modified from a local to global resource restriction. With a limit of five
5000 megabytes, only five download jobs will run simultaneously – even if executed remotely.
Executor specific resource settings, like spot instances
rule align:
input: "data/sequences.fasta"
output: "results/aligned.fasta"
resources:
preemptible=True # <------ Added this
mem_mb=32000
shell: "aligner --input {input} --output {output}"
This updates the alignment job from earlier to use preemptible (spot) instances for a FlowDeploy execution.
Using functions as resource values
def set_preemptible(wildcards, attempt):
return attempt < 3
rule align:
input: "data/sequences.fasta"
output: "results/aligned.fasta"
resources:
preemptible=set_preemptible
mem_mb=32000
retries: 3
shell: "aligner --input {input} --output {output}"
This runs alignment jobs on preemptible (spot) instances, but retries as non-preemptible (on-demand) instances after two
failures by defining a custom set_preemptible
function.
Tips
Look at what resources your pipeline is actually consuming
Sometimes tasks will be allocated more resources than they actually need, or spot instances might actually terminate at a rate that makes your pipelines cheaper to run on on-demand instances. Without keeping track of resource usage, you won't actually know how to optimize your resource usage!
GPU support is still largely executor dependent
GPU support – and the ability to request specific GPUs – is still largely executor dependent.