How to work with big data, and benchmarks of a more efficient version of
match_name() may run out of memory if your data is too big. Most software for data analysis has a limit to how much data it can handle with a given hardware. If your data is too big to run
match_name() on your computer, consider using only an informative subset of data or a more powerful computer. For example, here are some alternatives to consider:
match_name() with data of only one sector, or part of one sector (see
match_name() with data of only the loans that make up most of the credit limit or outstanding credit limit. You might need just 20% of the data to capture 80% of the credit; more data might not change the overall result.
match_name() on a powerful computer on the cloud.
Soon we’ll show examples of the approaches (1) and (2) above. Until then, you may want to watch RStudio’s webinar on Working with Big Data in R.
However you use
match_name(), it should use as little time and memory as it is reasonably possible. That is our goal. Here I compare two versions of
match_name(): the version in development versus the version on CRAN (r2dii.match 0.0.3). Compared to the version on CRAN, the version in development uses a small fraction of the time and memory. The rest of this post shows the benchmarks.
library(bench) library(devtools) library(dplyr) library(fs) library(ggplot2) library(r2dii.data)
I’ll use the names devel and cran to refer to the versions of
match_name() that are, respectively, in development and on CRAN (r2dii.match 0.0.3).
# The older version on CRAN packageVersion("r2dii.match") #>  '0.0.3' # Copy of match_name on CRAN cran <- r2dii.match::match_name # The newer version in development suppressMessages(devtools::load_all(fs::path_home("git", "r2dii.match"))) packageVersion("r2dii.match") #>  '0.0.3.9000' # Copy of match_name in development devel <- r2dii.match::match_name
Both versions have different source code:
# Confirm the two versions of `match_name` are different identical(devel, cran) #>  FALSE
Compared to the version on CRAN, the version in development takes less time. It calls the expensive garbage collector fewer times, and at a more economic level. (I use
check = FALSE because the output is not identical; the two outputs differ in the order of their rows, but if we reorder the rows in the same way, both outputs are equivalent.)
benchmark <- bench::mark( check = FALSE, iterations = 30, matched_devel = matched_devel <- devel(loanbook_demo, ald_demo), matched_cran = matched_cran <- cran(loanbook_demo, ald_demo) ) # No output means that the two expressions are indeed equivalent testthat::expect_equivalent( matched_devel %>% arrange(across(names(.))), matched_cran %>% arrange(across(names(.))) ) benchmark #> # A tibble: 2 x 6 #> expression min median `itr/sec` mem_alloc `gc/sec` #> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> #> 1 matched_devel 155.36ms 167.5ms 5.60 NA 4.67 #> 2 matched_cran 1.11s 1.5s 0.685 NA 6.07 ggplot2::autoplot(benchmark)
Thanks to your feedback,
match_name() is becoming more efficient. We expect to release the improved version on CRAN soon.