Requests For Comments for the St. Jude Rust Labs project
This repo structure and contents are stolen borrowed from the St. Jude Cloud Team’s rfcs repo.
These RFCs are meant to act as a public archive of our design intentions and to facilitate conversation both amongst the internal team and any external parties who may wish to help shape the future of the Sprocket project. Notably, these documents are not authoritative in any fashion. They are a snapshot of design goals; some details are important to hash out ahead of time, but practical experience with an implementation or changing understanding of a problem space as learned through use of a feature will likely cause deviations from the initial plans as laid out in these RFCs.
We will not make an effort to backport changes to these documents if we feel the conversations have run their course.
Install
cargo install mdbook
Usage
mdbook build
python3 -m http.server -d book
# visit the rendered version in your browser at http://localhost:8000.
📝 License and Legal
This project is licensed as either Apache 2.0 or MIT at your discretion. Additionally, please see the disclaimer that applies to all crates and command line tools made available by St. Jude Rust Labs.
Copyright © 2023-Present St. Jude Children’s Research Hospital.
- Feature Name: sprocket-test
- Start Date: 2025-08
Summary
The Sprocket test framework enables WDL authors to easily and comprehensively validate their WDL tasks and workflows by defining lightweight unit tests that can run in CI environments. This framework is intended to be intuitive and concise.
This framework can maximize test depth without adding boilerplate by allowing users to define “test matrices”, where each WDL task or workflow is run with permutations of the provided inputs.
Motivation
Comprehensive unit testing is a key component of modern software development. Any serious project should endeavor to have a suite of tests that ensure code correctness. These tests should be lightweight enough to run during continuous integration on every set of committed changes.
n.b.: This RFC is primarily focused on CI unit testing, but enabling larger scale “end-to-end” testing is something that should be kept in mind during this design process. That said, I’m of the opinion these are separate enough use cases that they can be approached with differing APIs, and thus decisions made here should not impact any future E2E testing API too heavily.
Guide-level explanation
The Sprocket test framework is primarily specified in TOML, which is expected to be in a file of the same basename as the WDL being tested, but with the .wdl extension replaced by .toml. sprocket test does not require any special syntax or modification of actual WDL files, and any WDL workspace can begin writing tests without needing to refactor their WDL documents. Following this pattern frees the TOML from having to contain any information about where to find the entrypoint of each test. For example, all the test entrypoints (tasks or workflows) in data_structures/flag_filter.toml are expected to be defined in the WDL located at data_structures/flag_filter.wdl.
Any task or workflow defined in data_structures/flag_filter.wdl can have any number of tests associated with it in data_structures/flag_filter.toml. To write a set of tests for a task validate_string_is_12bit_int in flag_filter.wdl, the user will define an array of tables in their TOML with the header [[validate_string_is_12bit_int]]. Multiple tests can be written for the validate_string_is_12bit_int task by repeating the [[validate_string_is_12bit_int]] header. The validate_flag_filter workflow can be tested the same way, by defining a TOML array of tables headered with [[validate_flag_filter]]. These TOML headers must match an entrypoint in the corresponding WDL file. Under each of these headers will be a TOML table with all the required information for a single test.
An example TOML for specifying a suite of tests for the flag_filter.wdl document in the workflows repo would look like:
[[validate_string_is_12bit_int]]
name = "decimal_passes" # each test must have a unique identifier
[validate_string_is_12bit_int.inputs]
number = "5"
# without any assertions explicitly configured, Sprocket will consider the task executing with a 0 exit code to be a "pass" and any non-zero exit code as a "fail"
[[validate_string_is_12bit_int]]
name = "hexadecimal_passes"
[validate_string_is_12bit_int.inputs]
number = "0x900"
[validate_string_is_12bit_int.assertions]
stdout.contains = "Input number (0x900) is valid" # builtin assertion for checking STDOUT logs
[[validate_string_is_12bit_int]]
name = "too_big_hexadecimal_fails"
[validate_string_is_12bit_int.inputs]
number = "0x1000"
[validate_string_is_12bit_int.assertions]
exit_code = 42 # the task should fail for this test
stderr.contains = "Input number (0x1000) is invalid" # similar to the stdout assertion
[[validate_string_is_12bit_int]]
name = "too_big_decimal_fails"
[validate_string_is_12bit_int.inputs]
number = "4096"
[validate_string_is_12bit_int.assertions]
exit_code = 42
stderr.contains = [
"Input number (4096) interpreted as decimal",
"But number must be less than 4096!",
] # `contains` assertion can also be an array of strings
[[validate_flag_filter]] # a workflow test
name = "valid_FlagFilter_passes"
[validate_flag_filter.inputs.flags]
include_if_all = "3" # decimal
exclude_if_any = "0xF04" # hexadecimal
include_if_any = "03" # octal
exclude_if_all = "4095" # decimal
[[validate_flag_filter]]
name = "invalid_FlagFilter_fails"
[validate_flag_filter.inputs.flags]
include_if_all = "" # empty string
exclude_if_any = "this is not a number"
include_if_any = "000000000011" # binary interpreted as octal. Too many digits for octal
exclude_if_all = "4095" # this is fine
[validate_flag_filter.assertions]
should_fail = true
Hopefully, everything in the above TOML is easily enough grokked that I won’t spend time going through the specifics in much detail. The flag_filter.wdl WDL document contains a task and a workflow, both with minimal inputs and no outputs, making the tests fairly straightforward. One of Sprocket’s guiding principles is to only introduce complexity where it’s warranted, and I hope that this example demonstrates a case where complexity is not warranted. Next, we will be discussing features intended for allowing test cases that are more complex, but the end API exposed to users (the focus of this document) still aims to maintain simplicity and intuitiveness.
Test Data
Most WDL tasks and workflows have File type inputs and outputs, so there should be an easy way to incorporate test files into the framework. This can be accomplished with a tests/fixtures/ directory in the root of the workspace which can be referred to from any TOML test. If the string $FIXTURES is found within a TOML string value within the inputs table, the correct path to the fixtures directory will be dynamically inserted at test run time. This avoids having to track relative paths from TOML that may be arbitrarily nested in relation to test data. For example, let’s assume there are test.bam, test.bam.bai, and reference.fa.gz files located within the tests/fixtures/ directory; the following TOML inputs table could be used regardless of where that actual .toml file resides within the WDL workspace:
bam = "$FIXTURES/test.bam"
bam_index = "$FIXTURES/test.bam.bai"
reference_fasta = "$FIXTURES/reference.fa.gz"
Builtin Assertions
Sprocket will support a variety of common test conditions. In this document so far, we’ve seen a few of the most straightforward conditions already in the assertions table of the earlier TOML example (exit_code, stdout.contains, stderr.contains and should_fail). For the initial release of sprocket test, these builtin assertions will probably remain as a rather small and tailored set, but the implementation should make extending this set in subsequent releases simple and non-breaking. Adding new builtin assertions could be a recommended starting point for new contributors, similar to how new lint rules are fairly straightforward to add.
Some assertions that might exist at initial release:
exit_code = <int>(should an array of ints be supported?)should_fail = <bool>: only available for workflow tests! Task tests should instead specify anexit_codestdout: will be a TOML table with a sub-tests related to checking a tasks STDOUT log (not available for workflow tests)contains = <string | array of strings>: strings are REsnot_contains = <string | array of strings>: strings are REs
stderr: functionally equivalent to thestdouttests, assertions but runs on the STDERR log insteadoutputs: a TOML table populated with task or workflow output identifiers. The specific assertions available will depend on the WDL type of the specified output<WDL Boolean> = <true|false><WDL Int> = <TOML int><WDL Float> = <TOML float><WDL String>contains = <string | array of strings>: strings are REsnot_contains = <string | array of strings>: strings are REsequals = <string>an exact match RE
<WDL File>name = <string>: glob pattern that should matchmd5 = <string>: md5sum that the file should haveblake3 = <string>: blake3 hash that the file should havesha256 = <string>: sha256 hash that the file should havecontains = <string | array of strings>: REs to expect within the file contentsnot_contains = <string | array of strings>: inverse ofcontains
The above is probably (about) sufficient for an initial release. Thoughts about future assertions that could exist will be discussed in the “Future possibilities” section.
Custom Assertions
While the builtin assertions should try and address many common use cases, users need a way to test for things outside the scope of the builtins (especially at launch, when the builtins will be minimal). There needs to be a way for users to execute arbitrary code on the outputs of a task or workflow for validation. This will be exposed via the assertions.custom test, which will accept a name or array of names of user supplied executables (most commonly shell or Python scripts) which are expected to be found in a tests/custom/ directory. These executables will be invoked with a positional argument which is a path to the task or workflow’s outputs.json. Users will be responsible for parsing that JSON and performing any validation they desire. So long as the invoked executable exits with a code of zero, the test will be considered as passed.
For further discussion of why this design was chosen, see the rationale section. The path to an outputs.json is required for this to be usable, but we could consider other paths or information which may be valuable in a test context which we could expose via other arguments or environment variables.
Example
tools/picard.toml
[[merge_sam_files]]
name = "Merge works"
[merge_sam_files.inputs]
bams = [
"$FIXTURES/test1.bam",
"$FIXTURES/test2.bam",
]
prefix = "test.merged"
[merge_sam_files.assertions]
custom = "quickcheck.sh"
Sprocket will look for an executable file named quickcheck.sh in the tests/custom/ directory. That file could contain any arbitrary code, such as:
#!/bin/bash
set -euo pipefail
out_json=$1
out_bam=$(jq -r .bam "$out_json")
samtools quickcheck "$out_bam"
Test Matrices
Often, it makes sense to validate that a variety of inputs result in the same test result. While the TOML definitions shared so far are relatively concise, repeating the same test conditions for many different inputs can get repetitive and the act of writing redundant boilerplate can discourage testing best practices. Sprocket offers a “shortcut” for avoiding this boilerplate, by defining test matrices. These test matrices can be a way to reach comprehensive test depth with minimal boilerplate. A test matrix is created by defining a matrix TOML array of tables for a set of test inputs. Each permutation of the “input vectors” will be run, which can be leveraged to test many conditions with a single test definition. Sprocket will evaluate the Cartesian product of the tables in the matrix array and run each combination of input values.
Below, you will find an example for a bam_to_fastq task defines 3*2*2*2*2*2*1 = 96 different permutations of the task inputs which should each be executed by Sprocket using only ~30 lines of TOML.
[[bam_to_fastq]]
name = "kitchen_sink"
[[bam_to_fastq.matrix]]
bam = [
"$FIXTURES/test1.bam",
"$FIXTURES/test2.bam",
"$FIXTURES/test3.bam",
]
bam_index = [
"$FIXTURES/test1.bam.bai",
"$FIXTURES/test2.bam.bai",
"$FIXTURES/test3.bam.bai",
]
[[bam_to_fastq.matrix]]
bitwise_filter = [
{ include_if_all = "0x0", exclude_if_any = "0x900", include_if_any = "0x0", exclude_if_all = "0x0" },
{ include_if_all = "00", exclude_if_any = "0x904", include_if_any = "3", exclude_if_all = "0" },
]
[[bam_to_fastq.matrix]]
paired_end = [true, false]
[[bam_to_fastq.matrix]]
retain_collated_bam = [true, false]
[[bam_to_fastq.matrix]]
append_read_number = [true, false]
[[bam_to_fastq.matrix]]
output_singletons = [true, false]
[[bam_to_fastq.matrix]]
prefix = ["kitchen_sink_test"] # the `prefix` input will be shared by _all_ permutations of the test matrix
# this test is to ensure all the options (and combinations thereof) are valid
# so no assertions beyond a `0` exit code are needed here
This is perhaps an extreme test case, but it was contrived as a stress test of sorts for the matrix design. This specific case may be too intense to run in a CI environment, but should demonstrate the power of test matrices in aiding comprehensive testing without undue boilerplate.
(Notably, the actual bam_to_fastq task in samtools.wdl (here) does not have a bam_index input, but that was added to this example for illustrative purposes)
REVIEWERS: I can write more examples of “real” TOML test files, as presumably we will be switching the workflows repo to this framework, in which case any tests written as examples here can hopefully just be re-used with minimal modification for the production tests we want. So don’t be afraid to ask for more examples! I just didn’t want to overload this document ;D
Configuration
All of the expected paths, tests/fixtures/ and tests/custom/, will be configurable. tests/ conflicts with pytest-workflow, so users may want to rename the default directories to something like sprocket-tests/.
Test Filtering
Users will be able to annotate each test with arbitrary tags which will allow them to run subsets of the entire test suite. They will also be able to run the tests in a specific file, as opposed to the default sprocket test behavior which will be to recurse the working directory and run all found tests. This will facilitate a variety of applications, most notably restricting the run to only what the developer knows has changed and parallelizing CI runs.
We may also want to give some tags special meaning: it is common to annotate “slow” tests and to exclude them from runs by default and we may want to make reduce friction in configuring that case.
Reference-level explanation
REVIEWERS: is this section needed?
This is the technical portion of the RFC. Explain the design in sufficient detail that:
- Its interaction with other features is clear.
- It is reasonably clear how the feature would be implemented.
- Corner cases are dissected by example.
The section should return to the examples given in the previous section, and explain more fully how the detailed proposal makes those examples work.
Drawbacks
Q: What are reasons we should we not do this?
A: none! This is a great idea!
To be serious, pytest-workflow seems to be the best test framework for WDL that I’ve been able to find as a WDL author, and as a user of that framework, I think WDL could use something more tailored. I will elaborate further in Prior Art.
Rationale and alternatives
REVIEWERS: I’ve thought through quite a wide variety of implementations that have not made it into writing, and I’m not sure how valuable my musings on alternatives I didn’t like are. I can expand on this section if it would be informative.
Custom Assertions Rationale
The custom assertion design is meant to maximize flexibility without adding implementation complexity. The implementation proposed couldn’t be much simpler: simply invoke an arbitrary executable with a single positional argument, expect an exit code of zero and anything else is a failed test.
This child process will inherit the parent process’s environment, and it will ultimately be up to test authors for ensuring their test environment and dependencies are correct. This may lead to debugging difficulties, and Sprocket will be able to offer very little help with what’s going on (aside from forwarding the STDOUT and STDERR streams).
This is a large drawback of the design, but I believe the flexibility offered here is worth those pains. Users can drop shell scripts, Python scripts, bespoke Rust binaries, or anything else using any framework they want, so long as it has a +x bit and can process a positional argument.
Prior art
This document has been largely informed by my experience as a WDL author and maintainer of the stjudecloud/workflows repository. The CI of that repo uses pytest-workflow.
pytest-workflow
pytest-workflow has been a great stop-gap tool for us. It is a generalized test framework not specific to WDL, which is ultimately what makes it unwieldly for our use cases. The generality of pytest-workflow necessitates a lot of boilerplate in our tests, and was proving a disincentive to writing comprehensive tests. Tests were a pain to write, as a lot of redundant text had to be written for hooking up inputs and outputs in a way that both pytest-workflow and a WDL engine could work with.
The WDL community should have a better solution than a generic test framework.
That said, if you are familiar with pytest-workflow, you will likely see some similarities between it and my proposal. I’ve leaned on the existing designs used in pytest-workflow, and made them more specific and ergonomic for WDL. There are 3 primary ways this RFC distinguishes itself from pytest-workflow:
- Understanding IO for WDL, eliminating boilerplate
- Matrix testing to increase test depth
- Advanced builtin assertions
The third point is more aspirational than concrete for the initial release. See future-possibilities for elaboration.
REVIEWERS: I can elaborate further if asked
wdl-ci
I found this tool while looking for existing frameworks when starting this document; which is to say it’s new to me and I have not tried running it, but it has some interesting capabilities.
This is not a unit testing framework, and looks geared towards system testing or end-to-end testing, or whatever term you want to use for ensuring consistency and reproducibility while developing a workflow. Definitely something worth revisiting when we circle back to that use case, but at the moment this is just designed for a different use than my proposal.
pytest-wdl
Last update was 4 years ago, which we on the workflows repo considered a deal breaker when we were initially shopping for CI testing. It seems very similar to pytest-workflow, at least on the surface, (of course they are both plugins to the popular pytest framework) but admittedly I have not dug very deep into this project.
CZ ID repo
This is just a WDL repo, not a full test framework, but they do have a bespoke CI/CD set up that I reviewed. Uses miniwdl under the hood and seems worth mentioning, but not a generalizable approach.
Future possibilities
Builtin tests
I think there’s a lot of room for growth in the builtin test conditions. This document just has what I think are about appropriate for an initial release (i.e. are relatively easy to implement), but that shouldn’t be the end of the builtin tests. I can imagine us writing bioinformatics specific tests using the noodles library for testing things like “is this output file a valid BAM?”, “is this BAM coordinate sorted?”, “is the output BAI file correctly matched to the output BAM?”, and many many more such tests.
Adoption by the WDL specification?
TOML based test definitions could be lifted over to WDL meta sections if this approach to testing proves valuable. This RFC is concerned with an external testing framework, but this possibility could be explored further down the road.
Validating other engines
First off, I don’t think this is something we want to pursue within Sprocket, but I didn’t want to omit the possibility from this document.
Supporting multiple engines/runners/environments/etc. is valuable and something many WDL authors are looking for. In the workflows repo, we currently validate our tasks with both sprocket run and miniwdl run; ideally we’d like to expand that to include others as well, but it is tricky to get running smoothly.
To be blunt, I think this is out of scope for what Sprocket should be focusing on. An existing “generalized” framework (like pytest-workflow) would be better suited for this kind of validation.
Individual test files
An annoyance for me while working on the workflows CI (with pytest-workflow) is that I often have to write individual input JSON files that are then pointed to in the test definition with a relative path. This meant opening two files to figure out what a test was doing; and the pathing was a pain due to our repo structure and the differing path resolution of Sprocket and miniwdl. This proposal aimed to keep all the relevant test information colocated in a single TOML table, but that does create a restriction where the inputs can’t be trivially used in a different context.
We could explore an alternative syntax that allows test inputs to be defined separately from the test.
Integration of custom assertions
The current proposal for custom assertions is pretty bare bones. This allows for a great deal of flexibility at very little implementation complexity, but we may want to offer tighter integration in the future. Maybe instead of invoking plain executables, we could integrate Python in some way? Calling out Python explicitly, as it is a popular (particularly among bioinformaticians) and flexible language. However environment management with Python dependencies can be a bit of a nightmare, and I’m not really sure of an ergonomic way we could integrate that.
E2E testing
As stated in the “motivation” section, this proposal is ignoring end-to-end (or E2E) tests and is really just focused on enabling unit testing for CI purposes. Perhaps some of this could be re-used for an E2E API, but I have largely ignored that aspect. (Also I have lots of thoughts about what that might look like, but for brevity will not elaborate further.)
Caching
At the time of writing, Sprocket does not yet have a call caching feature. But once that feature lands, it will prove useful for this framework as a way to reduce runtime on subsequent test runs.
- Feature Name: call-caching
- Start Date: 2025-10
Summary
The Sprocket call caching feature enables sprocket run to skip executing
tasks that have been previously executed successfully and instead reuse the
last known outputs of the task.
Motivation
Sprocket currently cannot resume a failed or canceled workflow, meaning that it must re-execute every task that completed successfully on a prior run of that workflow.
As tasks can be very expensive to execute, in terms of both compute resources and time, this can be a major barrier to using Sprocket in the bioinformatics space.
The introduction of caching task outputs so that they can be reused in lieu of re-executing a task will allow Sprocket to quickly resume a workflow run and reduce redundant execution.
Cache Key
The cache key will a Blake3 digest derived from hashing the following:
- The WDL document URI string.
- The task name as a string.
- The sequence of (name, value) pairs that make up the task’s inputs, ordered lexicographically by name.
This implies that the task’s cache key is sensitive to the WDL document being moved, the task being renamed, or the input values to the task changing; any of the above will cause a cache miss and Sprocket will not distinguish the cause for a cache miss when the key changes.
The result is a 32 byte Blake3 digest that can be represented as a
lowercase hexadecimal string, (e.g.
295192ea1ec8566d563b1a7587e5f0198580cdbd043842f5090a4c197c20c67a) for the
purpose of cache entry file names.
Call Cache Directory
Once a cache key is calculated, it can be used to locate an entry within the call cache directory.
The call cache directory may be configured via sprocket.toml:
[run.task]
cache_dir = "<path_to_cache>"
The default call cache location will be the user’s cache directory joined with
./sprocket/calls.
The call cache directory will contain an empty .lock file that will be used
to acquire shared and exclusive file locks on the entire call cache; the lock
file serves to coordinate access between sprocket run and a future sprocket clean
command.
During the execution of sprocket run, only a single shared lock will be
acquired on the .lock file and kept for the entirety of the run.
The call cache will have no eviction policy, meaning it will grow unbounded. A
future sprocket clean command might give statistics of current cache sizes
with the option to clean them, if desired. The sprocket clean command would
take an exclusive lock on the .lock file to block any sprocket run
command from executing while it is operating.
Each entry within the call cache will be a file with the same name of the task’s cache key.
During a lookup of an entry in the cache, a shared lock will be acquired on the individual entry file. During the updating of an entry in the cache, an exclusive lock will be acquired on the individual entry file.
The entry file will contain a JSON object with the following information:
{
"version": 1, // A monotonic version for the entry format.
"command": "<string-digest>", // The digest of the task's evaluated command.
"container": "<string>", // The container used by the task.
"shell": "<string>", // The shell used by the task.
"requirements": {
"<key>": "<value-digest>", // The requirement key and value digest
// ...
},
"hints": {
"<key>": "<value-digest>", // The hint key and value digest
// ...
},
"inputs": {
"<path-or-url>": "<content-digest>", // The previous backend input and its content digest
// ...
},
"exit": 0, // The last exit code of the task.
"stdout": {
"location": "<path-or-url>", // The location of the last stdout output.
"digest": "<content-digest>". // The content digest of the last stdout output.
},
"stderr": {
"location": "<path-or-url>", // The location of the last stderr output.
"digest": "<content-digest>" // The content digest of the last stderr output.
},
"work": {
"location": "<path-or-url>", // The location of the last working directory.
"digest": "<content-digest>" // The content digest of the last working directory.
}
}
Note: as a cache entry may contain absolute paths pointing at files in the
runs directory, deleting or moving a runs directory may invalidate entries
in the call cache.
See the section on cache entry digests for information on how the digests in the cache entry file are calculated.
Call Cache Hit
Checking for a cache hit acquires a shared lock on the call cache entry file.
A cache entry lookup only occurs for the first execution of the task; the call cache is skipped for subsequent retries of the task.
A call cache hit will occur if all of the following criteria are met:
- A file with the same name as the task’s cache key is present in the call cache directory and the file can be deserialized to the expected JSON object.
- The cache entry’s
versionfield matches the cache version expected by Sprocket. - The digest of the currently executing task’s evaluated command matches the
cache entry’s
commandfield. - The container used by the task matches the cache entry’s
containerfield. - The shell used by the task matches the cache entry’s
shellfield. - The digests of the task’s requirements exactly match those in the cache
entry’s
requirementsfield. - The digests of the task’s hints exactly match those in the cache entry’s
hintsfield. - The digests of the task’s backend inputs exactly match those in the cache
entry’s
inputsfield. - The digest of the cache entry’s
stdoutfield matches the current digest of its location. - The digest of the cache entry’s
stdoutfield matches the current digest of its location. - The digest of the cache entry’s
workfield matches the current digest of its location.
If any of the criteria above are not met, the failing criteria is logged (e.g. “entry not present in the cache”, “command was modified”, “input was modified”, “stdout file was modified”, etc.) and it is treated as a cache miss.
Upon a call cache hit, a TaskExecutionResult will be created from the
stdout, stderr, and work fields of the cache entry and task execution
will be skipped.
Call Cache Miss
Upon a call cache miss, the task will be executed by passing the request to the task execution backend.
After the task successfully executes on its first attempt only, the following occurs:
- Content digests will be calculated for
stdout,stderr, andworkof the execution result returned by the execution backend. - An exclusive lock is acquired on the cache entry file.
- The new cache entry is JSON serialized into the cache entry file.
If a task fails to execute on its first attempt, the task’s cache entry will not be updated regardless of a successful retry.
Note: a non-zero exit code of a task’s execution is not inherently a failure as the WDL task may specify permissible non-zero exit codes.
Cache Entry Digests
A cache entry may contain three different types of digests as lowercase hexadecimal strings:
- A digest produced by hashing an internal string.
- A digest produced by hashing a WDL value.
- A content digest of a backend input.
Blake3 will be used as the hash algorithm for producing the digests.
Hashing Internal Strings
Hashing an internal string (i.e. a string used internally by the engine) will update the hasher with:
- A four byte length value in little endian order.
- The UTF-8 bytes representing the string.
Hashing WDL Values
For hashing the values of the requirements and hints section, a Blake3
hasher will be updated as described in this section.
Compound values will recursively hash their contained values.
Hashing a None value
A None value will update the hasher with:
- A byte with a value of
0to indicate aNonevariant.
Hashing a Boolean value
A Boolean value will update the hasher with:
- A byte with a value of
1to indicate aBooleanvariant. - A byte with a value of
1if the value istrueor0if the value isfalse.
Hashing an Int value
An Int value will update the hasher with:
- A byte with a a value of
2to indicate anIntvariant. - An 8 byte value representing the signed integer in little endian order.
Hashing a Float value
A Float value will update the hasher with:
- A byte with a value of
3to indicate aFloatvariant. - An 8 byte value representing the float in little endian order.
Hashing a String value
A String value will update the hasher with:
- A byte with a value of
4to indicate aStringvariant. - The internal string value of the
String.
Hashing a File value
A File value will update the hasher with:
- A byte with a value of
5to indicate aFilevariant. - The internal string value of the
File.
For the purpose of hashing a File value, the contents of the file specified
by the value are not considered.
If the File is a backend input, the contents will be taken into consideration
when backend input content digests are produced.
Hashing a Directory value
A Directory value will update the hasher with:
- A byte with a value of
6to indicate aDirectoryvariant. - The internal string value of the
Directory.
For the purpose of hashing a Directory value, the contents of the directory
specified by the value are not considered.
If the Directory is a backend input, the contents will be taken into
consideration when backend input content digests are produced.
Hashing a Pair value
A Pair value will update the hasher with:
- A byte with a value of
7to indicate aPairvariant. - The recursive hash of the
leftvalue. - The recursive hash of the
rightvalue.
Hashing an Array value
An Array value will update the hasher with:
- A byte with a value of
8to indicate anArrayvariant. - The sequence of elements contained in the array, in insertion order.
Hashing a Map value
A Map value will update the hasher with:
- A byte with a value of
9to indicate aMapvariant. - The sequence of (key, value) pairs, in insertion order.
Hashing an Object value
An Object value will update the hasher with:
- A byte with a value of
10to indicate anObjectvariant. - The sequence of (key, value) pairs, in insertion order.
Hashing a Struct value
A Struct value will update the hasher with:
- A byte with a value of
11to indicate aStructvariant. - The sequence of (field name, value) pairs, in field declaration order.
Hashing a hints value (WDL 1.2+)
A hints value will update the hasher with:
- A byte with a value of
12to indicate ahintsvariant. - The sequence of (key, value) pairs, in insertion order.
Hashing an input value (WDL 1.2+)
An input value will update the hasher with:
- A byte with a value of
13to indicate aninputvariant. - The sequence of (key, value) pairs, in insertion order.
Hashing an output value (WDL 1.2+)
An output value will update the hasher with:
- A byte with a value of
14to indicate anoutputvariant. - The sequence of (key, value) pairs, in insertion order.
Hashing Sequences
Hashing a sequence will update the hasher with:
- A four byte length value in little endian order.
- The hash of each element in the sequence.
Content Digests
As wdl-engine already calculates digests of files and directories for
uploading files to cloud storage, the call caching implementation will make use
of the existing content digest cache, with some improvements.
Keep in mind that File and Directory values may be either local file paths
(e.g. /foo/bar.txt) or remote URLs (e.g. https://example.com/bar.txt,
s3://foo/bar.txt, etc.).
Local File Digests
Calculating the content digest of a local file is as simple as feeding every
byte of the file’s contents to a Blake3 hasher; functions that mmap
large files to calculate the digest will also be utilized.
Remote File Digests
A HEAD request will be made for the remote file URL.
If the remote URL is for a supported cloud storage service, the response is
checked for the appropriate metadata header (e.g. x-ms-meta-content_digest,
x-amz-meta-content-digest, or x-goog-meta-content-digest) and the header is
treated like a Content-Digest header.
Otherwise, the response must have either a Content-Digest
header or a strong ETag header. If the response does not have the
required header or if the header’s value is invalid, it is treated as a failure
to calculate the content digest.
If the HEAD request is unsuccessful and the error is considered to be a
“transient” failure (e.g. a 500 response), the HEAD request is retried
internally up to some configurable limit. If the request is unsuccessful after
exhausting the retries, it is treated as a failure to calculate the content
digest.
If a Content-Digest header was returned, the hasher is updated with:
- A
0byte to indicate the header wasContent-Digest. - The algorithm string of the header.
- The sequence of digest bytes of the header.
If an ETag header was returned, the hasher is updated with:
- A
1byte to indicate the header wasETag. - The strong ETag header value string.
Note that Sprocket will not verify that the content digest reported by the header matches the actual content digest of the file as that requires downloading the file’s entire contents.
Local Directory Digests
The content digest of a local directory is calculated by recursively walking the directory in a consistent order and updating a Blake3 hasher based on each entry of the directory.
A directory’s entry is hashed with:
- The relative path of the directory entry.
- A
0byte if the entry is a file or1if it is a directory. - If the entry is a file, the hasher is updated with the contents of the file.
Finally, a four byte (little endian) entry count value is written to the hasher before it is finalized to produce the 32 byte Blake3 content digest of the directory.
Note: it is an error if the directory contains a symbolic link to a directory that creates a cycle (i.e. to an ancestor of the directory being hashed).
Remote Directory Digests
cloud-copy has the facility to walk a “directory” cloud storage URL; it uses
the specific cloud storage API to list all objects that start with the
directory’s prefix.
The content digest of a remote directory is calculated by using cloud-copy to
walk the directory in a consistent order and then updating a Blake3 hasher
based on each entry of the directory.
A directory’s entry is hashed with:
- The relative path of the entry from the base URL.
- The 32 byte Blake3 digest of the remote file entry.
Finally, a four byte (little endian) entry count value is written to the hasher before it is finalized to produce the 32 byte Blake3 content digest of the directory.
Enabling Call Caching
A setting in sprocket.toml can control whether or not call caching is
enabled for every invocation of sprocket run:
[run.task]
cache = "off|on|explicit" # defaults to `off`
The supported values for the cache setting are:
off- do not check the call cache or write new cache entries at all.on- check the call cache and write new cache entries for all tasks except those that have acacheable: falsehint.explicit- check the call cache and write new cache entries only for tasks that have acacheable: truehint.
Sprocket will default the setting to off as it safer to let users consciously
opt-in than potentially serve stale results from the cache without the user’s
knowledge that call caching is occurring.
Opting Out
When call caching has been enabled, users may desire to opt-out of call caching
for individual tasks or a single sprocket run invocation.
Task Opt Out
An individual task may opt out of call caching through the use of the
cacheable hint:
hints {
"cacheable": false
}
The cacheable hint defaults to false if the task.cache setting is
explicit; otherwise, the hint defaults to true.
When cacheable is false, the call cache is not checked prior to task
execution and the result of the task’s execution is not cached.
Run Opt Out
A single invocation of sprocket run may pass the --no-call-cache option.
Doing so disables the use of the call cache for that specific run, both in terms of looking up results and storing results in the cache.
Failure Modes for Sprocket
Sprocket currently uses a fail fast failure mode where Sprocket immediately
attempts to cancel any currently executing tasks and return the error it
encountered. This is also the behavior of the user invoking Ctrl-C to
interrupt evaluation.
Failing this way may cancel long-running tasks that would otherwise have succeeded and subsequently prevent caching results for those tasks.
To better support call caching, Sprocket should be enhanced to support a fail slow failure mode (the new default), with users able to configure Sprocket to use the previous fail fast behavior when desired.
With a fail slow failure mode, currently executing tasks are awaited to completion, and their successful results are cached before attempting to abort the run.
This also changes how Sprocket handles Ctrl-C. Sprocket should now support
multiple Ctrl-C invocations depending on its configured failure mode:
- If the configured failure mode is fail slow, the user invokes
Ctrl-Cand Sprocket prints a message informing the user that it is waiting on outstanding task executions to complete and to hitCtrl-Cagain to cancel tasks instead. It then proceeds to wait for executing tasks to complete to cache successful results. - The user invokes
Ctrl-Cand Sprocket prints a message informing the user that it is now canceling the executing tasks and to hitCtrl-Cagain to immediately terminate Sprocket. It then proceeds to cancel the executing tasks and wait for the cancellations to occur. - The user invokes
Ctrl-Cand Sprocket immediately errors with a “evaluation aborted” error message.
The failure mode can be configured via sprocket.toml:
[run]
fail = "slow|fast" # Defaults to `slow`
Provenance Tracking and Analysis Management
Table of Contents
- Summary
- Motivation
- Architecture Overview
- Database Schema
- Directory Structure
- Index Functionality
- CLI Workflow
- Server Mode
- Rationale and Alternatives
Summary
This RFC proposes a comprehensive provenance tracking and analysis management system for Sprocket built on the principle of progressive disclosure. The system automatically tracks all workflow executions in a SQLite database while maintaining a dual filesystem organization: a complete provenance record preserving every execution detail in chronological directories (runs/), and an optional user-defined logical index (index/) that symlinks outputs into domain-specific hierarchies (e.g., by project or analysis type). This approach scales from simple single-workflow use cases to production analysis management systems handling thousands of samples, providing both the auditability of complete execution history and the usability of custom organization without requiring users to choose one or maintain both manually.
Motivation
Sprocket currently focuses on executing workflows and producing outputs, but several critical aspects of production bioinformatics work remain unaddressed. The current output directory structure, organized solely by workflow name and timestamp, makes it difficult for users to find their results, particularly when managing multiple projects. While a complete provenance filesystem organized by execution time provides valuable auditability—preserving every execution detail, retry attempt, and task output in a structured hierarchy—this very completeness creates organizational complexity. Outputs are scattered across timestamped directories deep within nested task execution paths, with no logical organization by project or data type. Users need both the complete provenance record for reproducibility and a simplified, domain-specific view for everyday access. Currently, they must choose one or the other, or maintain both manually through external scripts. There is also no way to track all workflows run against a given sample or understand analysis lineage over time.
Additionally, Sprocket maintains no persistent record of execution history. Users cannot query which workflows were run, when they executed, what inputs were provided, or who submitted them. This lack of provenance tracking makes it impossible to audit past analyses or understand the evolution of results. Furthermore, there is no real-time visibility into running workflows across multiple submissions, making it difficult to monitor the overall state of an analysis pipeline or identify bottlenecks.
Existing solutions in the broader ecosystem fall into two categories. Lightweight engines handle execution well but leave tracking and organization to external tools that users must discover, install, and configure separately. Enterprise analysis management systems provide comprehensive tracking but require significant infrastructure to deploy, creating a barrier to entry that prevents users from simply trying them out.
This RFC proposes a middle path through progressive disclosure: Sprocket provides sophisticated analysis management capabilities that activate automatically as users need them, without requiring upfront infrastructure or configuration.
User Journey
A user learns about Sprocket and wants to try a yak shaving workflow. They have a yak named Fluffy who desperately needs a haircut, so they download Sprocket and run:
sprocket run yak_shaving.wdl -i defaults.json yak_name=fluffy style=mohawk
A directory called out/ is created automatically in their current working directory. Within it, they find their workflow outputs organized in out/runs/yak_shaving/<timestamp>/, where the timestamp corresponds to when they ran the workflow. A database at out/database.db silently tracks the execution, but the user doesn’t need to think about it. They have their results as returned in the output JSON. This is the Sprocket “light-disclosure” experience—working results with zero configuration.
The user finds this approach easy and begins to wonder if Sprocket can style all yaks in their herd. Beyond just performing the styling, the user hopes Sprocket can organize the yak satisfaction surveys for long-term safe keeping.
Looking through the documentation, they discover the --index-on flag and adapt their workflow:
for YAK in yaks/*; do
sprocket run yak_shaving.wdl -i defaults.json \
yak_name=$YAK style=mohawk \
--index-on "YakProject/2025/$YAK"
done
As these workflows complete, a new directory structure appears under out/index/YakProject/2025/, with each yak’s photos and satisfaction surveys organized in subdirectories by yak name. The index contains symlinks pointing back to files in out/runs/, so the historical execution record remains intact while the index provides the logical organization the user cares about. If they rerun a yak’s styling, the index automatically updates to point to the new results while the database’s index_log table preserves the complete history of what was indexed at each point in time. The entire directory structure is portable—moving the out/ directory with mv preserves all relationships. With just one additional flag, they’ve unlocked organized output management across their entire herd. This is the Sprocket “medium-disclosure” experience.
Now satisfied with their organized outputs, the user realizes they’d like real-time monitoring of running workflows and the ability to submit styling orders remotely from their laptop while execution happens on a shared server. They run:
sprocket server --port 8080
The HTTP server starts immediately, connecting to the existing database and making all historical runs queryable through a REST API. They can now submit workflows via HTTP and monitor progress through API queries. All workflows submitted through the API are tracked in the same database alongside CLI submissions, and all outputs appear in the same out/ directory structure. There’s no database setup, no configuration files to manage, no migration of historical data—they simply started the server and gained remote access and monitoring capabilities.
As the yak grooming business grows, the user’s team begins submitting hundreds of workflows per day from multiple servers. SQLite’s single-writer model handles this well initially, but they eventually want to scale to thousands of concurrent submissions with multiple Sprocket servers sharing a central database. At this point, they provision a PostgreSQL database and update their configuration:
[database]
url = "postgresql://postgres:postgres@db.example.com:5432/sprocket"
Then they run a single command to transfer their existing SQLite data:
sprocket database transfer --from ./out/database.db --to postgresql://postgres:postgres@db.example.com:5432/sprocket
This transfers all historical workflow executions and index logs to PostgreSQL. The familiar output directory structure remains unchanged—runs/ and index/ still live on the filesystem—but the database now runs on dedicated infrastructure with MVCC-based concurrency control, enabling unlimited concurrent writers across multiple Sprocket server instances. The user has progressed from zero-configuration local execution to enterprise-scale workflow orchestration, with each transition requiring configuration only when their needs demanded it. This is the Sprocket “heavy-disclosure” experience.
This progression happens naturally as the user’s needs grow. Each step builds on the previous one, with infrastructure complexity introduced only when scale demands it.
Architecture Overview
The system consists of three independent but coordinated components sharing a common output directory and database.
-
CLI Execution Mode (
sprocket run). Direct workflow execution that initializes the output directory on first use, creates the database schema via migrations, executes workflows in-process, writes execution metadata transactionally to the database, and creates index symlinks when specified via--index-on. -
Server Mode (
sprocket server). An HTTP API server that initializes the output directory on first use if needed, accepts workflow submissions via REST API, executes workflows using the same engine as CLI, provides query endpoints for run history and status, and shares the database with CLI via SQLite WAL mode for concurrency. -
Output directory (
./out/). Filesystem-based storage containingdatabase.db(a SQLite database tracking all executions),runs/<workflow>/<timestamp>/(execution directories organized by workflow and timestamp), andindex/(optional symlinked organization created via--index-on).
All three components use identical database schema and filesystem conventions, enabling seamless interoperability. A workflow submitted via CLI is immediately visible to the server, and vice versa.
To ensure consistent workflow execution behavior between CLI and server modes, we’ll refactor the current evaluation and execution code to use an actor-based architecture. A workflow manager actor will handle workflow lifecycle management, database updates, and index creation through message passing. The actor will spawn one or more Tokio tasks to execute workflows concurrently. For sprocket run, the actor will manage a single workflow execution task. For sprocket server, the same actor implementation will manage multiple concurrent workflow execution tasks, one per submitted workflow. This shared architecture ensures that workflow execution semantics, error handling, and database interactions remain identical regardless of submission method.
Database Schema
The provenance database will use a simple schema optimized for common queries while keeping implementation straightforward. Each output directory will be versioned to ensure compatibility between different Sprocket releases.
SQLite Configuration
Sprocket will configure SQLite with specific pragma settings to optimize for concurrent access, performance, and data integrity. These settings are divided into two categories:
Persistent settings (applied once when database is created):
pragma journal_mode = wal;
pragma synchronous = normal;
pragma temp_store = memory;
journal_mode = walenables Write-Ahead Logging, allowing multiple concurrent readers with a single writer. This setting persists across all connections and provides the foundation for CLI and server to share the database.synchronous = normalbalances durability and performance in WAL mode. The database remains safe from corruption but may lose the most recent transaction in the event of a power failure or system crash.temp_store = memorystores temporary tables in memory for better performance.
Per-connection settings (applied when opening each connection):
pragma foreign_keys = on;
pragma busy_timeout = 5000;
pragma cache_size = 2000;
foreign_keys = onenables foreign key constraint enforcement for referential integrity.busy_timeout = 5000configures a 5-second timeout when the database is locked. If a write cannot proceed immediately due to another concurrent write, SQLite will retry for up to 5 seconds before returning an error. This prevents spurious failures during normal concurrent access patterns.cache_size = 2000allocates approximately 8MB for SQLite’s page cache (assuming 4KB pages), improving query performance.
Metadata Table
The metadata table tracks the output directory schema version:
create table if not exists metadata (
-- Metadata key
key text primary key,
-- Metadata value
value text not null
);
-- Insert schema version
insert into metadata (key, value) values ('schema_version', '1');
Sprocket checks the schema_version on startup and automatically migrates the database schema to the current version if needed. Migrations are applied atomically to ensure database consistency.
Invocations Table
The invocations table groups related workflow submissions.
create table if not exists invocations (
-- Unique invocation identifier (UUID v4)
id text primary key,
-- How workflows are submitted — `cli` or `http`
submission_method text not null,
-- User or client that created the invocation
created_by text,
-- When the invocation was created
created_at timestamp not null
);
Each sprocket run command creates its own invocation with created_by populated from the $USER environment variable or system username. A running sprocket server instance creates a single invocation at startup that is shared by all workflows submitted to that server.
Workflows Table
The workflows table tracks individual workflow executions.
create table if not exists workflows (
-- Unique run identifier (UUID v4)
id text primary key,
-- A link to the invocation that created this workflow
invocation_id text not null,
-- Workflow name extracted from WDL document
name text not null,
-- Workflow source (file path, URL, or git reference)
source text not null,
-- Current execution status — `pending`, `running`, `completed`, or `failed`
status text not null,
-- JSON-serialized workflow inputs
inputs text,
-- JSON-serialized workflow outputs (`null` until completion)
outputs text,
-- Error message if status is `failed` (`null` otherwise)
error text,
-- Relative path to execution directory from `database.db` (e.g., `runs/workflow_name/2025-11-07_143022123456`)
execution_dir text not null,
-- When run record was created
created_at timestamp not null,
-- When execution started (`null` if still pending)
started_at timestamp,
-- When execution finished, success or failure (`null` if still running)
completed_at timestamp,
foreign key (invocation_id) references invocations(id)
);
Index Log Table
The index_log table tracks the history of index symlink updates.
create table if not exists index_log (
-- Unique log entry identifier (UUID v4)
id text primary key,
-- Path within the index directory (e.g., `YakProject/2025/Fluffy/final_photo.jpg`)
index_path text not null,
-- Target path relative to `database.db` that the symlink points to (e.g., `runs/yak_shaving/2025-11-07_143022123456/calls/trim_and_style/attempts/0/work/final_photo.jpg`)
target_path text not null,
-- Foreign key to `workflows.id` (which workflow created this symlink)
workflow_id text not null,
-- When this symlink was created or updated
created_at timestamp not null,
foreign key (workflow_id) references workflows(id)
);
Each time a workflow creates or updates a symlink in the index (via --index-on), a record is inserted into this table. For workflows that update an existing index path, both the old and new symlink targets are preserved in the log, enabling complete historical tracking. Users can query this table to determine what data was indexed at any point in time by finding the most recent log entry before a given date.
Concurrency
SQLite operates in WAL (Write-Ahead Logging) mode, which allows multiple concurrent readers, one writer at a time (writes are serialized), and readers to access the database during writes. This enables CLI and server to operate simultaneously on the same database without coordination. Database locks are held briefly (milliseconds per transaction), making contention unlikely for typical workflow submission rates. In the rare case of write conflicts, Sprocket will automatically retry the transaction with exponential backoff.
Directory Structure
Output Directory Layout
The output directory is the fundamental unit of organization containing all workflow executions, metadata, and indexes:
./out/
├─ database.db # SQLite provenance database
├─ database.db-shm # SQLite shared memory (WAL mode)
├─ database.db-wal # SQLite write-ahead log (WAL mode)
├─ runs/ # Workflow execution directories
│ └─ <workflow_name>/ # Workflow-specific directory
│ ├─ <timestamp>/ # Individual run (YYYY-MM-DD_HHMMSSffffff)
│ │ └─ calls/ # Task execution directories
│ │ └─ <task_call_id>/ # Task identifier (e.g., "hello-0")
│ │ ├─ attempts/ # Retry attempts directory
│ │ │ └─ <attempt_number>/ # Attempt number (0, 1, 2, ...)
│ │ │ ├─ command # Executed shell script
│ │ │ ├─ stdout # Task standard output
│ │ │ ├─ stderr # Task standard error
│ │ │ └─ work/ # Task working directory
│ │ │ └─ <output_files> # Task-generated output files
│ │ └─ tmp/ # Temporary localization files
│ └─ _latest -> <timestamp>/ # Symlink to most recent run (Unix only)
└─ index/ # Optional symlinked organization
└─ <user_defined_path>/ # Created via --index-on flag
└─ <symlinks_to_outputs> # Symlinks to files in runs/
Directory Behaviors
- On first
sprocket runorsprocket serverinvocation,./out/is created if missing,database.dbis initialized with schema migrations. If invoked viasprocket run, theruns/<workflow_name>/<timestamp>/directory structure is created for execution. - Output directory location defaults to
./out/relative to current working directory, configurable via--out-dirflag,SPROCKET_OUTPUT_DIRenvironment variable, or~/.config/sprocket/Sprocket.toml. - The entire output directory is relocatable via
mv. All paths stored indatabase.dbare relative to the database file, enabling portability when the output directory is moved.
Index Functionality
The index provides user-defined logical organization of outputs on top of the execution-oriented runs directory structure. On Windows, creating symlinks requires administrator privileges or Developer Mode (Windows 10 Insiders build 14972 or later). Windows 11 allows unprivileged symlink creation without Developer Mode. If Sprocket cannot create symlinks due to insufficient permissions, index creation will fail with an error instructing the user to run with administrator privileges or enable Developer Mode. Index symlinks are explicitly requested by the user via --index-on and their failure prevents the expected organization from being created. The _latest symlink (pointing to the most recent run) will be attempted on all platforms but will emit a debug-level log message on failure rather than an error, allowing workflows to complete successfully even when symlink creation is not possible. The _latest symlink is a convenience feature and its absence doesn’t prevent workflow completion or results access.
Index Creation
Users create indexes via the --index-on flag:
sprocket run yak_shaving.wdl -i inputs.json \
--index-on "YakProject/2025/Fluffy"
This produces:
./out/
├─ runs/
│ └─ yak_shaving/
│ └─ 2025-11-07_143022123456/
│ └─ calls/
│ └─ trim_and_style/
│ └─ attempts/
│ └─ 0/
│ └─ work/
│ ├─ final_photo.jpg
│ └─ grooming_report/
│ ├─ satisfaction.html
│ └─ style_metrics.txt
└─ index/
└─ YakProject/
└─ 2025/
└─ Fluffy/
├─ outputs.json # Complete workflow outputs (all types)
├─ final_photo.jpg -> ../../../../runs/yak_shaving/2025-11-07_143022123456/calls/trim_and_style/attempts/0/work/final_photo.jpg
└─ grooming_report -> ../../../../runs/yak_shaving/2025-11-07_143022123456/calls/trim_and_style/attempts/0/work/grooming_report
Symlinking Behavior
All workflow output files and directories (as declared in the WDL workflow’s output section) are symlinked into the index path. An outputs.json file containing the complete workflow outputs (including primitive types like strings, integers, and booleans that cannot be symlinked) is also written to the index path alongside the symlinks. Task-internal files not declared as workflow outputs are not indexed. File outputs are symlinked directly to the file, while directory outputs have the directory itself symlinked rather than individual files within it.
When a workflow is re-run with the same --index-on path, the existing outputs.json is replaced and existing symlinks at that path are removed. New symlinks pointing to the latest run outputs are created. The index always reflects the most recent successful run for a given index path, with the complete history preserved in the index_log database table.
Symlinks use relative paths (e.g., ../../../../runs/...), allowing the entire ./out/ directory to be moved while preserving index functionality. If the index directory is accidentally deleted or needs to be reconstructed, users can rebuild the index from the database history:
sprocket index rebuild --out-dir ./out
This command queries the index_log table for the most recent entry for each distinct index path and recreates the corresponding symlinks and outputs.json files, restoring the index to its last known state.
CLI Workflow
Execution Flow
sequenceDiagram
actor User
participant CLI as sprocket run
participant DB as database.db
participant FS as Filesystem
participant Engine as Workflow Engine
User->>CLI: sprocket run yak_shaving.wdl --index-on "YakProject/2025/Fluffy"
CLI->>FS: Check for ./out/, create if doesn't exist
CLI->>FS: Check for database.db, create if doesn't exist
CLI->>DB: Connect to database.db (WAL mode)
CLI->>DB: INSERT invocation (submission_method=cli, created_by=$USER)
CLI->>DB: INSERT workflow (status=pending)
CLI->>FS: Create runs/yak_shaving/2025-11-07_143022/
CLI->>DB: UPDATE workflow (status=running)
CLI->>Engine: Execute workflow
Engine->>FS: Write execution files (stdout, stderr, outputs)
Engine-->>CLI: Execution complete
CLI->>DB: UPDATE workflow (status=completed, outputs=...)
CLI->>FS: Create index symlinks at YakProject/2025/Fluffy/
CLI-->>User: Workflow complete
The output directory location is resolved from command-line flags (--out-dir), environment variables (SPROCKET_output_dir), configuration file settings, or the default ./out/. Each CLI invocation generates a UUID for the run and creates its own invocation record with created_by populated from the $USER environment variable. Using the actor-based architecture described in the Architecture Overview, the workflow manager actor manages a single workflow execution task for the duration of the CLI command. When a workflow completes successfully, database records are updated with outputs and completion timestamp, and index symlinks are created if --index-on was specified. Failed workflows update the database with error information but do not create index symlinks, as there are no valid outputs to index.
Server Mode
The server provides an HTTP API for submitting workflows and querying run history.
Server Configuration
Execution Flow
sequenceDiagram
actor User
participant Server as sprocket server
participant DB as database.db
participant FS as Filesystem
participant Engine as Workflow Engine
Note over User,DB: Phase 1: Server Startup
User->>Server: sprocket server --out-dir ./out
Server->>FS: Check for ./out/, create if doesn't exist
Server->>FS: Check for database.db, create if doesn't exist
Server->>DB: Connect to database.db (WAL mode)
Server->>DB: INSERT invocation (submission_method=http, created_by=$USER)
Server->>Server: Start HTTP server
Server-->>User: Server running
Note over User,FS: Phase 2: Workflow Submission & Execution
User->>Server: POST /api/workflows {source, inputs, index_path}
Server->>DB: INSERT workflow (status=pending, invocation_id=...)
Server-->>User: 201 Created {id, status}
Server->>FS: Create runs/yak_shaving/2025-11-07_143022/
Server->>DB: UPDATE workflow (status=running)
Server->>Engine: Execute workflow (async)
Engine->>FS: Write execution files (stdout, stderr, outputs)
User->>Server: GET /api/workflows/{id}
Server->>DB: SELECT workflow WHERE id=...
Server-->>User: 200 OK {status: running, ...}
Engine-->>Server: Execution complete
Server->>DB: UPDATE workflow (status=completed, outputs=...)
Server->>FS: Create index symlinks if index_path provided
Note over User,FS: Phase 3: Viewing Completed Results
User->>Server: GET /api/workflows/{id}
Server->>DB: SELECT workflow WHERE id=...
Server-->>User: 200 OK {status: completed, outputs: {...}}
The server resolves the output directory location from command-line flags, environment variables, configuration file settings, or the default ./out/. On startup, it initializes the output directory and database if not present, or connects to an existing database in WAL mode for concurrent access. A single invocation record is created at server startup with submission_method = 'http', and all workflows submitted to that server instance share this invocation. Using the actor-based architecture described in the Architecture Overview, the workflow manager actor spawns independent Tokio tasks for each submitted workflow, enabling concurrent execution without blocking API requests. An optional concurrency limit may be configured to control maximum parallel workflow executions based on available system resources.
REST API Endpoints
POST /api/workflows
Submit a new workflow for execution.
Request:
{
"source": "https://github.com/user/repo/yak_shaving.wdl",
"inputs": {
"yak_name": "Fluffy",
"style": "mohawk"
},
"index_path": "YakProject/2025/Fluffy" // Optional
}
Response: 201 Created
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"status": "pending",
"created_at": "2025-11-07T14:30:22Z"
}
GET /api/workflows
Query workflow executions with optional filters.
Request:
GET /api/workflows?status=running&name=yak_shaving&limit=50
Response: 200 OK
{
"workflows": [
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"name": "yak_shaving",
"status": "running",
"invocation_id": "660e8400-e29b-41d4-a716-446655440001",
"created_at": "2025-11-07T14:30:22Z",
"started_at": "2025-11-07T14:30:23Z"
}
]
}
GET /api/workflows/{id}
Retrieve detailed information about a specific workflow execution.
Request:
GET /api/workflows/550e8400-e29b-41d4-a716-446655440000
Response: 200 OK
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"name": "yak_shaving",
"source": "https://github.com/user/repo/yak_shaving.wdl",
"status": "completed",
"invocation_id": "660e8400-e29b-41d4-a716-446655440001",
"inputs": {
"yak_name": "Fluffy",
"style": "mohawk"
},
"outputs": {
"final_photo": "runs/yak_shaving/2025-11-07_143022123456/calls/trim_and_style/attempts/0/work/final_photo.jpg"
},
"execution_dir": "runs/yak_shaving/2025-11-07_143022123456",
"created_at": "2025-11-07T14:30:22Z",
"started_at": "2025-11-07T14:30:23Z",
"completed_at": "2025-11-07T14:45:10Z"
}
CLI and server operate simultaneously on the same output directory. A workflow submitted via sprocket run appears in GET /api/workflows queries as soon as the database transaction commits. All workflows share the same database regardless of submission method.
Rationale and Alternatives
Why SQLite?
The choice of SQLite as the provenance database is fundamental to achieving progressive disclosure. The database must support several key scenarios.
- Users running
sprocket run workflow.wdlfor the first time should get working provenance tracking without installing, configuring, or even thinking about a database. - Multiple
sprocket runprocesses must safely write to the database simultaneously, and a runningsprocket servermust be able to query historical runs created by CLI while also accepting new submissions. - Moving the output directory with
mvorrsyncshould preserve all functionality without database reconfiguration or path updates. - Database operations should add negligible overhead to workflow execution, which typically takes minutes to hours.
SQLite excels at meeting these requirements because it is embedded directly in the Sprocket binary. The database file is created automatically on first use with no user action required—there are no connection strings to configure, no server processes to start, no permissions to manage. The database is a single file (database.db) within the output directory, so moving the directory preserves the database without any connection reconfiguration. This filesystem-based portability is essential to the self-contained output directory design.
SQLite’s Write-Ahead Logging (WAL) mode enables multiple concurrent readers with a single writer at a time, where reads never block writes and vice versa. This is sufficient for workflow submission patterns, where writes are infrequent—one per workflow execution, taking milliseconds—compared to workflow execution time of minutes to hours. Even in high-throughput environments, workflow submission rates rarely exceed one per second, while SQLite WAL mode handles 400 write transactions per second and thousands of reads on modest hardware and 3,600 writes per second with 70,000 reads per second in standard benchmarks. Operational simplicity is another major advantage—there’s no backup strategy beyond filesystem backups, no version compatibility issues between client and server, and no network latency or connection pooling concerns.
While not a major driving factor, SQLite’s ubiquity should be considered for users who want to build custom tooling on top of the provenance database. Native language bindings exist for virtually every programming language and platform, and the file format is stable and well-documented. Users can query the database directly using standard SQLite clients, build custom analysis scripts in their preferred language, or integrate the database into dashboards without needing Sprocket-specific APIs.
Future Work: PostgreSQL Support
PostgreSQL support is planned as future work to enable enterprise-scale deployments. PostgreSQL offers better concurrent write performance through MVCC (Multi-Version Concurrency Control), remote database access over the network, more sophisticated query optimization for complex queries, and proven scalability to millions of records. These capabilities become valuable when organizations need to run multiple Sprocket server instances sharing a central database, or when workflow submission rates exceed SQLite’s single-writer throughput.
However, PostgreSQL requires separate server installation and configuration, violating the zero-configuration principle that makes Sprocket accessible to new users. Users must provision database infrastructure, manage connection strings with host, port, and credentials, and coordinate backups independently of the output directory. Network dependencies introduce new failure modes, and version compatibility between PostgreSQL server and Sprocket client becomes an operational concern. For the majority of use cases—single servers with infrequent writes and simple queries—PostgreSQL’s complexity outweighs its benefits.
The implementation strategy will introduce a database abstraction layer that keeps SQLite as the default while allowing users to opt into PostgreSQL via configuration. An automatic migration tool will transfer existing SQLite databases to PostgreSQL, preserving all workflow execution history and index logs. The filesystem-based output directory structure (runs/ and index/) will remain unchanged regardless of database backend, ensuring that users can migrate from SQLite to PostgreSQL without disrupting their existing workflows or file organization. This approach maintains progressive disclosure: users start with zero-configuration SQLite and migrate to PostgreSQL only when scale demands it.
Alternative: Embedded Key-Value Stores (RocksDB, LMDB)
Embedded key-value stores like RocksDB or LMDB are appealing because they share SQLite’s zero-configuration, embedded nature while offering exceptional performance. RocksDB appears to be the strongest option, achieving ~86,000 writes per second for random overwrites and ~137,000-189,000 reads per second, while LMDB is optimized for read performance with competitive write throughput. Though RocksDB bulk insertion showed around ~1 million writes per second in benchmarks, workflow metadata tracking performs multiple individual transactional writes per workflow (invocation inserts, workflow record inserts and updates, index log entries), making the random overwrite performance more applicable to this use case. These systems are purpose-built for high-throughput append-heavy workloads and would handle workflow metadata operations effortlessly. However, they were rejected for this initial implementation because they complicate data access patterns without providing necessary performance benefits.
The performance advantage of key-value stores is irrelevant for this use case. Workflow execution takes minutes to hours, while database writes complete in milliseconds regardless of the underlying storage engine. Even at high submission rates of one workflow per second, SQLite’s throughput of hundreds to thousands of transactions per second provides ample headroom. The bottleneck is never the database—it’s the workflow execution itself.
While key-value stores do lack the familiar SQL interface that users expect when querying execution history, this is a secondary concern compared to the implementation and maintenance complexity they introduce. If future requirements reveal that SQLite’s single-writer limitation becomes a genuine bottleneck—which would require sustained submission rates exceeding hundreds per second—key-value stores would be reconsidered as a filesystem-based solution. However, such rates are unrealistic for workflow engines, and if they materialize, PostgreSQL with its MVCC support would likely be a better fit for the access patterns involved.
Alternative: Filesystem Only (No Database)
A filesystem-only approach without a database was rejected because it creates filesystem stress and operational problems, particularly in HPC environments where Sprocket is likely to be deployed. This approach would store metadata in JSON files alongside workflow execution directories (e.g., runs/<workflow>/<timestamp>/metadata.json). While this has zero dependencies beyond the filesystem, is simple to implement, and is naturally portable, it introduces several critical issues.
Storing metadata as individual files creates inode exhaustion problems common in HPC environments, where each workflow execution would generate multiple small metadata files. HPC filesystems often have strict inode quotas, and large numbers of small files create bottlenecks on metadata servers that degrade performance for all users of the shared filesystem. Queries require walking directory trees and parsing potentially thousands of JSON files, putting additional stress on filesystem metadata operations—the exact workload that HPC storage administrators actively discourage. There’s no standardized query interface and no efficient indexing for common queries like “show all running workflows.” Monitoring tools must implement custom file-walking logic, and race conditions emerge when updating metadata from multiple processes without database-level transaction guarantees.
Drafts
The following are candidate RFCs that are being rendered for easy review. They may still be revised. For more information please see the associated pull request.