Improving r2dii.match

How to work with big data, and benchmarks of a more efficient version of match_name()

Mauro Lepore https://github.com/maurolepore
07-18-2020

match_name() may run out of memory if your data is too big. Most software for data analysis has a limit to how much data it can handle with a given hardware. If your data is too big to run match_name() on your computer, consider using only an informative subset of data or a more powerful computer. For example, here are some alternatives to consider:

  1. Feed match_name() with data of only one sector, or part of one sector (see filter()).

  2. Feed match_name() with data of only the loans that make up most of the credit limit or outstanding credit limit. You might need just 20% of the data to capture 80% of the credit; more data might not change the overall result.

  3. Run match_name() on a powerful computer on the cloud.

Soon we’ll show examples of the approaches (1) and (2) above. Until then, you may want to watch RStudio’s webinar on Working with Big Data in R.

However you use match_name(), it should use as little time and memory as it is reasonably possible. That is our goal. Here I compare two versions of match_name(): the version in development versus the version on CRAN (r2dii.match 0.0.3). Compared to the version on CRAN, the version in development uses a small fraction of the time and memory. The rest of this post shows the benchmarks.


Packages:


library(bench)
library(devtools)
library(dplyr)
library(fs)
library(ggplot2)
library(r2dii.data)

I’ll use the names devel and cran to refer to the versions of match_name() that are, respectively, in development and on CRAN (r2dii.match 0.0.3).


# The older version on CRAN
packageVersion("r2dii.match")
#> [1] '0.0.3'
# Copy of match_name on CRAN
cran <- r2dii.match::match_name

# The newer version in development
suppressMessages(devtools::load_all(fs::path_home("git", "r2dii.match")))
packageVersion("r2dii.match")
#> [1] '0.0.3.9000'
# Copy of match_name in development
devel <- r2dii.match::match_name

Both versions have different source code:


# Confirm the two versions of `match_name` are different
identical(devel, cran)
#> [1] FALSE

Compared to the version on CRAN, the version in development takes less time. It calls the expensive garbage collector fewer times, and at a more economic level. (I use check = FALSE because the output is not identical; the two outputs differ in the order of their rows, but if we reorder the rows in the same way, both outputs are equivalent.)


benchmark <- bench::mark(
  check = FALSE,
  iterations = 30,
  matched_devel = matched_devel <- devel(loanbook_demo, ald_demo),
  matched_cran = matched_cran <- cran(loanbook_demo, ald_demo)
)

# No output means that the two expressions are indeed equivalent
testthat::expect_equivalent(
  matched_devel %>% arrange(across(names(.))),
  matched_cran %>% arrange(across(names(.)))
)

benchmark
#> # A tibble: 2 x 6
#>   expression         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 matched_devel 155.36ms  167.5ms     5.60         NA     4.67
#> 2 matched_cran     1.11s     1.5s     0.685        NA     6.07

ggplot2::autoplot(benchmark)

Thanks to your feedback, match_name() is becoming more efficient. We expect to release the improved version on CRAN soon.

Citation

For attribution, please cite this work as

Lepore (2020, July 18). Data science at 2DII: Improving r2dii.match. Retrieved from https://2degreesinvesting.github.io/posts/2020-07-18-improving-r2dii-match/

BibTeX citation

@misc{lepore2020improving,
  author = {Lepore, Mauro},
  title = {Data science at 2DII: Improving r2dii.match},
  url = {https://2degreesinvesting.github.io/posts/2020-07-18-improving-r2dii-match/},
  year = {2020}
}