tiltWorkflows

Using parameters

The workflow files may have parameters in their header that allow you to change the behavior of the document without changing its code.

For example, consider this header:

---
title: "tiltWorkflows"
params: 
  output: "output"
  file: "example.csv"
---

Each parameter is an element of the params list:

params
#> $output
#> [1] "output"
#> 
#> $file
#> [1] "example.csv"

The workflow files use those parameters in many ways, for example, to create paths:

file.path(params$output, params$file)
#> [1] "output/example.csv"

You can set your own parameters in RStudio. Open the document and click on “Knit with Parameters …”:

Change the defaults as you need then click on “Kint”.

Learn more about parametrized computational documents.

You can share a link to the output “.md” file of each workflow by pasting its contents into a new GitHub gist, or from the terminal with the gh CLI, e.g:

gh gist create emissions_profile.md

Also you may share input/ output/ and the cache/ directories by compressing them into a .zip file, uploading them to an online drive, then sharing the link.

Handling large datasets and long run times

The tiltWorkflows package handles a large *companies dataset by splitting it in chunks that fit in your computer’s memory, and running multiple chunks in parallel across multiple cores. The entire process may still take a long time, but tiltWorkflows caches intermediate results so you can interrupt the process at any point and resume later without recomputing completed chunks.

That may be enough for you, but you can still complete your job faster. You can run multiple processes in a complementary way and no interruption.

Running multiple processes in a complementary way

Multiple processes can cooperate and complete all chunks faster than possible with a single process.

It can be twice faster when two processes run the same workflow and feed the same cache but work through chunks in reverse order with the parameter order = 'rev'.

You can run the same workflow from multiple processes in a number of ways:

Launch multiple instance of RStudio and click on Knit.
Write multiple .R files and run them as a background job in RStudio:

# job1.R
rmarkdown::render('profile_emissions.Rmd')

# job2.R
rmarkdown::render('profile_emissions.Rmd', params = list(order = 'rev'))

From multiple instances of the Terminal use Rscript with an R expression:

# Terminal 1
Rscript -e "rmarkdown::render('profile_emissions.Rmd')"

# Terminal 2
Rscript -e "rmarkdown::render('profile_emissions.Rmd', params = list(order = 'rev'))"

For even faster results use the parameter order = 'sample' and run as many processes as you want.

No interruptions

To avoid interruptions you can use a remote computer.

I typically compute in containers from the rocker/verse image running on a Docker droplet on DigitalOcean. For interactive work I use RStudio in the container. But for non-interactive, long-running scripts I use tmux: I ssh into a terminal on the remote server, start a tmux session, and use as many terminals as necessary to execute and monitor jobs inside the container.

Using parameters

Sharing your work

Handling large datasets and long run times

Running multiple processes in a complementary way

No interruptions