Shared file systems
More about the shared, distributed file systems we like and use.
Low-to-medium scale
We use an S3-based POSIX-compliant distributed file system that's mounted via FUSE. It's great for low-to-medium scale pipelines, because it:
- has a low initial cost.
- scales like an S3 bucket.
- bioinformaticians can mount it to their computer and use it like a hard drive.
Transfer speeds for large files is similar to S3 – ~250 MB/s. Caching is configured to speed up usage without requiring an S3 call for every file operation.
High scale
For higher scale pipelines, Lustre is preferred (FSx for Lustre on AWS). It's higher cost and harder to manage, but it incurs less latency for file I/O, doesn't use S3 bandwidth, and has a much higher maximum bandwidth.
How it's used
The shared file system is mounted on the machine running the workflow manager, as well as all cloud-based worker nodes. This means the workflow manager running the S3-based file system can be located anywhere, while a shared Lustre file system requires the workflow manager to run in the same private network.
Debugging
Initial Sync Time
The first time a machine uses the S3-backed file system, it may take a minute to synchronize with the shared file system.
Apple Silicon M1 and M2 Macs
Mounting any FUSE-based shared file system requires a system change in startup mode. This applies to all S3-based file systems.
S3 File Handling
When using S3 files directly with a workflow manager in FlowDeploy Develop mode, they are downloaded through the machine running the workflow manager.