7 - Workflow management • mbtools

The API design of mbtools makes it well suited for workflow managers. This is particularly useful when you have longer analysis chains. For instance, let’s look at an example workflow. We start with the config:

library(mbtools)

## Also loading:

##   - dada2=1.12.1
##   - data.table=1.12.6
##   - ggplot2=3.2.1
##   - magrittr=1.5
##   - phyloseq=1.28.0
##   - ShortRead=1.42.0
##   - yaml=2.2.0

## Found tools:

##   - minimap2=2.17-r941
##   - slimm=0.3.4
##   - samtools=1.9

## 
## Attaching package: 'mbtools'

## The following object is masked _by_ 'package:BiocGenerics':
## 
##     normalize

## The following object is masked from 'package:graphics':
## 
##     layout

ref <- system.file("extdata/genomes/zymo_mock.fna.gz", package = "mbtools")
reads <- system.file("extdata/nanopore", package = "mbtools")

conf_p <- config_preprocess(
    trimLeft = 10,
    maxEE = Inf,
    maxN = Inf,
    truncQ = 0,
    out_dir = "preprocessed"
)
conf_a <- config_align(
    reference = ref,
    alignment_dir = "alignments",
    threads = 4,
    use_existing = FALSE
)
conf_c <- config_count(
    reference = ref,
    weights = TRUE,
    maxit = 1000,
    threads = 4
)

Where our actual workflow looks like this:

counting <- find_read_files(reads) %>%
            quality_control(min_score = 20) %>%
            preprocess(conf_p) %>%
            align_long_reads(conf_a) %>%
            count_references(conf_c)

This looks pretty clear but it will run the entire workflow everytime we execute the code. This is particularly costly if we only want to change some parameters of the last step. For instance to check what the effect of weighting on counting is.

In this case we don’t need to run preprocessing and aligning again. We could implement this with several if-statements and by saving the immediate outputs but this makes the code way more convoluted. Also this would not tell us if the input files changed.

Workflow management with drake

One popular workflow manager for R is drake which will automatically track what needs to be recomputed and what not. This requires only minimal changes to mbtools workflows, mostly wrapping all input and output location with file_in and file_out.

library(drake)

## 
## Attaching package: 'drake'

## The following object is masked from 'package:ShortRead':
## 
##     clean

## The following object is masked from 'package:S4Vectors':
## 
##     expand

reads <- file_in("../inst/extdata/nanopore")

plan <- drake_plan(
    files = find_read_files(reads),
    qa = quality_control(files, min_score = 20),
    qual_plot = ggsave(file_out("quals.png"), qa$quality_plot),
    processed = preprocess(files, conf_p),
    aligned = align_long_reads(processed, conf_a),
    counting = count_references(aligned, conf_c)
)

We can take a look if our plan makes sense:

clean()
dconf <- drake_config(plan)
vis_drake_graph(dconf)

As we see drake knows we have to run everything. Let’s execute the plan and see what happens. Note, the lock_envir argument which is currently necessary.

make(plan, lock_envir = FALSE)

## target files

## target qa

## INFO [2019-11-05 08:21:21] Running quality assay for forward reads from 3 files.
## INFO [2019-11-05 08:21:34] Median per base error is 12.589% (median score = 9.00).
## INFO [2019-11-05 08:21:34] Mean per cycle entropy is 1.578 (in [0, 2]).
## INFO [2019-11-05 08:21:34] On average we have 1000.00 +- 0.00 reads per file.

## target processed

## INFO [2019-11-05 08:21:42] Preprocessing reads for 3 single-end samples...
## INFO [2019-11-05 08:21:47] 2.99e+03/3e+03 (99.53%) reads passed preprocessing.

## target qual_plot

## Target qual_plot messages:
##   Saving 7.29 x 4.51 in image

## target aligned

## INFO [2019-11-05 08:21:48] Aligning 3 samples on 4 threads. Keeping up to 100 secondary alignments.
## INFO [2019-11-05 08:21:51] Finished aligning even1.
## INFO [2019-11-05 08:21:54] Finished aligning even2.
## INFO [2019-11-05 08:21:57] Finished aligning even3.

## target counting

## INFO [2019-11-05 08:21:57] Getting reference lengths from /Library/Frameworks/R.framework/Versions/3.6/Resources/library/mbtools/extdata/genomes/zymo_mock.fna.gz...
## INFO [2019-11-05 08:21:57] Normalized IDs. Starting counting...

We can access our final output.

readd(counting)$counts %>% head()

##     reference     counts effective_length sample
## 1: AE004091.2   3.193363          6261124  even1
## 2: AE006468.2  26.257417          4854170  even1
## 3: AE016830.1 120.312710          3214751  even1
## 4: AE017341.1   2.296111         19049682  even1
## 5: AL009126.3 156.701221          4212326  even1
## 6: AL591824.1 168.907323          2941248  even1

Now if we check our plan…

vis_drake_graph(dconf)

We see that everything is up-to-date. So running the plan again would do nothing:

make(plan)

## All targets are already up to date.

Now lets change the weighting for counting.

conf_c$weights = FALSE

drake detects that we only have to rerun the counting.

outdated(dconf)

## [1] "counting"

Which we can do with running counting again.

make(plan, lock_envir = FALSE)

## target counting

## INFO [2019-11-05 08:21:58] Getting reference lengths from /Library/Frameworks/R.framework/Versions/3.6/Resources/library/mbtools/extdata/genomes/zymo_mock.fna.gz...
## INFO [2019-11-05 08:21:58] Normalized IDs. Starting counting...

7 - Workflow management

Workflow management with drake

Contents