After yet another tour through a whole stack of Python workflow systems, I still can't find one that beats @nextflowio for #bioinformatics. Here's a short thread on the fatal flaws of each:
@ApacheAirflow: popular and elegant, but it still has very poor (if any) support for HPC execution, and it has no concept of platform-native file storage (S3 on AWS, local filesystem on HPC etc).
@dask_dev: a lovely minimal API with tight integrations for pandas and numpy, but this comes at the loss of explicit output caching (it may or may not decided to re-run any given task), and file handling.
@Toil_GI is always the first engine I try, and some big improvements have been made lately (like migrating to Python 3). But if you run a workflow to completion and then edit the workflow, it isn't able to cache the successful tasks and must rerun them all.
@dagsterio is a relatively new player, and it comes with a clean declarative API and some neat new features like runtime type checking. Unfortunately it isn't very portable to HPC, and lacks the ability to cache dynamic tasks (e.g. scattering over each line in a file).
#snakemake relies heavily on the file-dependency idiom from make, which I have never found to suit my workflows. It also makes writing workflows very unintuitive (having to reason backwards from the goal), and dynamic scatter/gather is possible but very complicated.
Also while I'm here, the reason I like @nextflowio so much is that it ticks these boxes: supports tasks that produce files but also values, portable to HPC and cloud, backed by a real programming language you can import from, caches every task, and doesn't require a static DAG.
A few more I've looked at just now: @raydistributed actually does seem to have HPC support which is rare for newer engines, but sadly it doesn't have any mechanism for caching successful tasks between re-runs of a workflow.
@MetaflowOSS has a long running issue with storing files at all, which is vital for bioinformatics: github.com/Netflix/metafl…. It has built-in support for S3, but if you want any other kind of storage you're out of luck.
• • •
Missing some Tweet in this thread? You can try to
force a refresh