How to resolve memory issues
Users of the r2dii.match package reported that their R session crashed when they fed match_name()
with big data. A recent post acknowledged the issue and promised examples on how to handle big data. This article shows one approach: feed match_name()
with a sequence of small chunks of the loanbook
dataset.
This example uses r2dii.match plus a few optional but convenient packages, including r2dii.data for example datasets.
# Packages
library(dplyr, warn.conflicts = FALSE)
library(fs)
library(vroom)
library(r2dii.data)
library(r2dii.match)
# Example datasets from the r2dii.data package
loanbook <- loanbook_demo
ald <- ald_demo
If the entire loanbook
is too large, feed match_name()
with smaller chunks, so that any call to match_name(this_chunk, ald)
fits in memory. More chunks take longer to run but use less memory; you’ll need to experiment to find the number of chunks that best works for you.
Say you try three chunks. You can take the loanbook
dataset and then use mutate()
to add the new column chunk
, which assigns each row to one of the chunks
:
chunks <- 3
chunked <- loanbook %>% mutate(chunk = as.integer(cut(row_number(), chunks)))
The total number of rows in the entire loanbook
equals the sum of the rows across chunks.
count(loanbook)
#> # A tibble: 1 x 1
#> n
#> <int>
#> 1 320
count(chunked, chunk)
#> # A tibble: 3 x 2
#> chunk n
#> <int> <int>
#> 1 1 107
#> 2 2 106
#> 3 3 107
For each chunk you need to repeat this process:
ald
dataset.
# This "output" directory is temporary; you may use any folder in your computer
out <- path(tempdir(), "output")
if (!dir_exists(out)) dir_create(out)
for (i in unique(chunked$chunk)) {
# 1. Match this chunk against the entire `ald` dataset.
this_chunk <- filter(chunked, chunk == i)
this_result <- match_name(this_chunk, ald)
# 2. If this chunk matched nothing, move to the next chunk
matched_nothing <- nrow(this_result) == 0L
if (matched_nothing) next()
# 3. Else, save the result to a .csv file.
vroom_write(this_result, path(out, paste0(i, ".csv")))
}
The result is one .csv file per chunk.
dir_ls(out)
#> /tmp/Rtmp2sVXnU/output/1.csv /tmp/Rtmp2sVXnU/output/2.csv
#> /tmp/Rtmp2sVXnU/output/3.csv
You can read and combine all files in one step with vroom()
.
matched <- vroom(dir_ls(out))
matched
#> # A tibble: 502 x 29
#> rowid id_loan id_direct_loant… name_direct_loa… id_intermediate…
#> <dbl> <chr> <chr> <chr> <chr>
#> 1 1 L1 C294 Yuamen Xinneng … <NA>
#> 2 3 L3 C292 Yuama Ethanol L… IP5
#> 3 3 L3 C292 Yuama Ethanol L… IP5
#> 4 5 L5 C305 Yukon Energy Co… <NA>
#> 5 5 L5 C305 Yukon Energy Co… <NA>
#> 6 6 L6 C304 Yukon Developme… <NA>
#> 7 6 L6 C304 Yukon Developme… <NA>
#> 8 8 L8 C303 Yueyang City Co… <NA>
#> 9 9 L9 C301 Yuedxiu Corp One IP10
#> 10 10 L10 C302 Yuexi County AA… <NA>
#> # … with 492 more rows, and 24 more variables:
#> # name_intermediate_parent_1 <chr>, id_ultimate_parent <chr>,
#> # name_ultimate_parent <chr>, loan_size_outstanding <dbl>,
#> # loan_size_outstanding_currency <chr>,
#> # loan_size_credit_limit <dbl>,
#> # loan_size_credit_limit_currency <chr>,
#> # sector_classification_system <chr>,
#> # sector_classification_input_type <chr>,
#> # sector_classification_direct_loantaker <dbl>, fi_type <chr>,
#> # flag_project_finance_loan <chr>, name_project <lgl>,
#> # lei_direct_loantaker <lgl>, isin_direct_loantaker <lgl>,
#> # chunk <dbl>, id_2dii <chr>, level <chr>, sector <chr>,
#> # sector_ald <chr>, name <chr>, name_ald <chr>, score <dbl>,
#> # source <chr>
The matched
result should be similar to that of match_name(loanbook, ald)
. Your next steps are documented on the website of r2dii.match.
I tested match_name()
with datasets which size (on disk as a .csv file) was 20MB for the loanbook
dataset and 100MB for the ald
dataset. Feeding match_name()
with the entire loanbook
crashed my R session. But feeding it with a sequence of 30 chunks run in about 25’ – successfully; the combined result had over 10 million rows:
sector data
---------------------------------
1 automotive [2,644,628 × 15]
2 aviation [377,200 × 15]
3 cement [942,526 × 15]
4 oil and gas [1,551,805 × 15]
5 power [7,353,772 × 15]
6 shipping [4,194,067 × 15]
7 steel [15 × 15]
For attribution, please cite this work as
Lepore (2020, July 31). Data science at 2DII: Using `match_name()` with large loanbooks. Retrieved from https://2degreesinvesting.github.io/posts/2020-07-31-chunk-your-data/
BibTeX citation
@misc{lepore2020using, author = {Lepore, Mauro}, title = {Data science at 2DII: Using `match_name()` with large loanbooks}, url = {https://2degreesinvesting.github.io/posts/2020-07-31-chunk-your-data/}, year = {2020} }