The API design of mbtools
makes it well suited for workflow managers. This is particularly useful when you have longer analysis chains. For instance, let’s look at an example workflow. We start with the config:
library(mbtools)
## Also loading:
## - dada2=1.12.1
## - data.table=1.12.6
## - ggplot2=3.2.1
## - magrittr=1.5
## - phyloseq=1.28.0
## - ShortRead=1.42.0
## - yaml=2.2.0
## Found tools:
## - minimap2=2.17-r941
## - slimm=0.3.4
## - samtools=1.9
##
## Attaching package: 'mbtools'
## The following object is masked _by_ 'package:BiocGenerics':
##
## normalize
## The following object is masked from 'package:graphics':
##
## layout
ref <- system.file("extdata/genomes/zymo_mock.fna.gz", package = "mbtools")
reads <- system.file("extdata/nanopore", package = "mbtools")
conf_p <- config_preprocess(
trimLeft = 10,
maxEE = Inf,
maxN = Inf,
truncQ = 0,
out_dir = "preprocessed"
)
conf_a <- config_align(
reference = ref,
alignment_dir = "alignments",
threads = 4,
use_existing = FALSE
)
conf_c <- config_count(
reference = ref,
weights = TRUE,
maxit = 1000,
threads = 4
)
Where our actual workflow looks like this:
counting <- find_read_files(reads) %>%
quality_control(min_score = 20) %>%
preprocess(conf_p) %>%
align_long_reads(conf_a) %>%
count_references(conf_c)
This looks pretty clear but it will run the entire workflow everytime we execute the code. This is particularly costly if we only want to change some parameters of the last step. For instance to check what the effect of weighting on counting is.
In this case we don’t need to run preprocessing and aligning again. We could implement this with several if-statements and by saving the immediate outputs but this makes the code way more convoluted. Also this would not tell us if the input files changed.
One popular workflow manager for R is drake which will automatically track what needs to be recomputed and what not. This requires only minimal changes to mbtools
workflows, mostly wrapping all input and output location with file_in
and file_out
.
library(drake)
##
## Attaching package: 'drake'
## The following object is masked from 'package:ShortRead':
##
## clean
## The following object is masked from 'package:S4Vectors':
##
## expand
reads <- file_in("../inst/extdata/nanopore")
plan <- drake_plan(
files = find_read_files(reads),
qa = quality_control(files, min_score = 20),
qual_plot = ggsave(file_out("quals.png"), qa$quality_plot),
processed = preprocess(files, conf_p),
aligned = align_long_reads(processed, conf_a),
counting = count_references(aligned, conf_c)
)
We can take a look if our plan makes sense:
clean()
dconf <- drake_config(plan)
vis_drake_graph(dconf)
As we see drake
knows we have to run everything. Let’s execute the plan and see what happens. Note, the lock_envir
argument which is currently necessary.
make(plan, lock_envir = FALSE)
## target files
## target qa
## INFO [2019-11-05 08:21:21] Running quality assay for forward reads from 3 files.
## INFO [2019-11-05 08:21:34] Median per base error is 12.589% (median score = 9.00).
## INFO [2019-11-05 08:21:34] Mean per cycle entropy is 1.578 (in [0, 2]).
## INFO [2019-11-05 08:21:34] On average we have 1000.00 +- 0.00 reads per file.
## target processed
## INFO [2019-11-05 08:21:42] Preprocessing reads for 3 single-end samples...
## INFO [2019-11-05 08:21:47] 2.99e+03/3e+03 (99.53%) reads passed preprocessing.
## target qual_plot
## Target qual_plot messages:
## Saving 7.29 x 4.51 in image
## target aligned
## INFO [2019-11-05 08:21:48] Aligning 3 samples on 4 threads. Keeping up to 100 secondary alignments.
## INFO [2019-11-05 08:21:51] Finished aligning even1.
## INFO [2019-11-05 08:21:54] Finished aligning even2.
## INFO [2019-11-05 08:21:57] Finished aligning even3.
## target counting
## INFO [2019-11-05 08:21:57] Getting reference lengths from /Library/Frameworks/R.framework/Versions/3.6/Resources/library/mbtools/extdata/genomes/zymo_mock.fna.gz...
## INFO [2019-11-05 08:21:57] Normalized IDs. Starting counting...
We can access our final output.
## reference counts effective_length sample
## 1: AE004091.2 3.193363 6261124 even1
## 2: AE006468.2 26.257417 4854170 even1
## 3: AE016830.1 120.312710 3214751 even1
## 4: AE017341.1 2.296111 19049682 even1
## 5: AL009126.3 156.701221 4212326 even1
## 6: AL591824.1 168.907323 2941248 even1
Now if we check our plan…
vis_drake_graph(dconf)
We see that everything is up-to-date. So running the plan again would do nothing:
make(plan)
## All targets are already up to date.
Now lets change the weighting for counting.
drake
detects that we only have to rerun the counting.
outdated(dconf)
## [1] "counting"
Which we can do with running counting again.
make(plan, lock_envir = FALSE)
## target counting
## INFO [2019-11-05 08:21:58] Getting reference lengths from /Library/Frameworks/R.framework/Versions/3.6/Resources/library/mbtools/extdata/genomes/zymo_mock.fna.gz...
## INFO [2019-11-05 08:21:58] Normalized IDs. Starting counting...