Using parameters
The workflow files may have parameters in their header that allow you to change the behavior of the document without changing its code.
For example, consider this header:
---
title: "tiltWorkflows"
params:
output: "output"
file: "example.csv"
---
Each parameter is an element of the params
list:
params
#> $output
#> [1] "output"
#>
#> $file
#> [1] "example.csv"
The workflow files use those parameters in many ways, for example, to create paths:
file.path(params$output, params$file)
#> [1] "output/example.csv"
You can set your own parameters in RStudio. Open the document and click on “Knit with Parameters …”:
Change the defaults as you need then click on “Kint”.
Learn more about parametrized computational documents.
Sharing your work
You can share a link to the output “.md” file of each workflow by
pasting its contents into a new GitHub
gist, or from the terminal with the gh
CLI, e.g:
Also you may share input/ output/ and the cache/ directories by compressing them into a .zip file, uploading them to an online drive, then sharing the link.
Handling large datasets and long run times
The tiltWorkflows package handles a large *companies
dataset by splitting it in chunks that fit in your computer’s memory,
and running multiple chunks in parallel across multiple cores. The
entire process may still take a long time, but tiltWorkflows caches
intermediate results so you can interrupt the process at any point and
resume later without recomputing completed chunks.
That may be enough for you, but you can still complete your job faster. You can run multiple processes in a complementary way and no interruption.
Running multiple processes in a complementary way
Multiple processes can cooperate and complete all chunks faster than possible with a single process.
It can be twice faster when two processes run the same workflow and
feed the same cache but work through chunks in reverse order with the
parameter order = 'rev'
.
You can run the same workflow from multiple processes in a number of ways:
Launch multiple instance of RStudio and click on Knit.
Write multiple .R files and run them as a background job in RStudio:
# job1.R
rmarkdown::render('profile_emissions.Rmd')
# job2.R
rmarkdown::render('profile_emissions.Rmd', params = list(order = 'rev'))
- From multiple instances of the Terminal use
Rscript
with an R expression:
# Terminal 1
Rscript -e "rmarkdown::render('profile_emissions.Rmd')"
# Terminal 2
Rscript -e "rmarkdown::render('profile_emissions.Rmd', params = list(order = 'rev'))"
For even faster results use the parameter
order = 'sample'
and run as many processes as you want.
No interruptions
To avoid interruptions you can use a remote computer.
I typically compute in containers from the rocker/verse
image running on a Docker droplet
on DigitalOcean. For interactive work I use RStudio in the
container. But for non-interactive, long-running scripts I use
tmux
: I ssh
into a terminal on the remote
server, start a tmux
session, and use as many terminals as
necessary to execute and monitor jobs inside the container.