In our work on automating machine learning computations in cheminformatics with scientific workflow tools, I have came to realize something; Dynamic scheduling in scientific workflow tools is important and sometimes badly needed.
What I mean is that new tasks should be scheduleable during the execution of a workflow not just during its scheduling phase.
What is striking is that far from all workflow tools allow this. Many tools completely separates the execution in a workflow into two stages:
- Scheduling tasks to be run
- Executing those tasks
The case where we really needed this was for running machine learning algorithms on data sets of various data set sizes. To gain optimal models, we are first optimizing the cost parameter to our training step by running a parameter-sweep over a set of cost values.
The performance of training with different cost values is then evaluated and an optimal cost is chosen. And now comes the interesting part: We want to schedule a defined workflow with this newly selected cost value.
This is not easily possible in Luigi even with our SciLuigi extension though, since Luigi separates scheduling and execution. But also since parameters such as a cost value are initiated upon scheduling time. Thus we can not use a value resulting from a calculations to start the next task, in a SciLuigi workflow.
Of course we found a work-around for this: We just created a task that takes the chosen cost value and executes a shell command to start a separate python process with that other part of the workflow. It works. But things are not closely integrated, we get extra overhead and the separate workflow instance will create separate logging, audit files etc.
Thus, this is something I would like to see in the next workflow system I use: Ability to schedule new tasks continuously from during execution of a workflow.
Interestingly this is a feature that comes for free in tools that adhere to the dataflow paradigm. In most dataflow tools you have independently running processes that receive messages with input data that continuously schedule new tasks as they receive messages until the system hands them a message telling them to shut down. "Dynamic scheduling" is really how dataflow systems work in other words, which I find interesting. I think the dataflow system Nextflow (Correct me if I'm wrong, Paolo! :) ) works like this. And so does my little experiment in a pure Go workflow library, which I started hacking on out of frustration with some other tools a long time ago, although that one still lacks most other popular features ;)
I just had not realized how important this feature could be, for very common use cases.