Make your commandline tool workflow friendly

e · l · n

May 25, 2018

Update (May 2019): A paper incorporating the below considerations is published:
Björn A Grüning, Samuel Lampa, Marc Vaudel, Daniel Blankenberg,
"Software engineering for scientific big data analysis"
GigaScience, Volume 8, Issue 5, May 2019, giz054,
https://doi.org/10.1093/gigascience/giz054

There are a number of pitfalls that can make a commandline program really hard to integrate into a workflow (or "pipeline") framework. The reason is that many workflow tools use output file paths to keep track of the state of the tasks producing these files. This is done for example to know which tasks are finished and can be skipped upon a re-run, and which are not.

To make the interaction between workflow tools and commandline programs as easy as possible, the tools should generally avoid overly clever ways of specifying them. The optimal is often to be as explicit as possible while allowing the user (and thus the workflow tool) as much flexibility and control as possible over each output file path.

Before jumping in, just a disclaimer that I have experience from integrating tools primarily in Galaxy (though quite a years ago), Luigi/SciLuigi and SciPipe, and can not talk for all tools. Exceptions might exist, due to other ways of handling file paths. Generally though, I expect the same principles to apply to the majority of workflow tools which integrate commandline programs writing to a POSIX file system. Let's dive in:

Recommandations to commandline interface designers

Optimally, allow to customize completely the file name of every output file generated by the tool.
- Motivation: Workflow tools are often configured to detect if a certain output from a tool already exists, to enable to restart a halted workflow from already finished intermediate results. This will get most robust and correct if the workflow tool can tell the commandline program exactly which file names to write to.
If you for any reason can not reasonably take the exact file name for every output file (such as when producing a very large number of outputs), but need to take just a file name pattern - do allow to specify output folder, in addition to the file name pattern.
- Motivation: Not being able to specify the output folder, means that output files might be written directly to the folder were the workflow is running, which can mess up the folder of files, and even accidentally over-write existing files, if any name clashes exist.
Don't put limits on which file extensions (.tsv, .png etc) can be used for output files.
- Motivation: Some workflow tools add an own extension, such as .tmp, after the original file name, while the file is being written, and rename it to the correct file name after completion, so as to guarantee atomic writes.

For the workflow tools I have the most experience with, following these rules, should make integrating your commandline program way easier, more robust, and reproducible.

Do you have more suggestions? Please add them in the comments section!

Update (May 26): Björn Grüning maintains a more extensive list of recommendations, including some of the above points
Update (May 26): If you're interested, here is the twitter thread that sparked the idea of this post.
Update (May 27): Titus Brown has a blog post in the works, announced here, and discussions summarized here.
Update (May 27): Just learned that Julio Merino has a post series from 2013 on CLI design.
Update (May 2019): A paper incorporating the above considerations is published: Software engineering for scientific big data analysis.