Handling a large loanbook
handling-large-loanbook.Rmd
If your loanbook
is large, or if you lack
postcode
or country
, your computer may run out
of memory and crash. This article shows an alternative way to run the
first step of the matching process: Suggesting matches. You’ll check the
quality of your loanbook
before expensive computations, and
match multiple small chunks of your loanbook
one by one
(like here).
Read your loanbook
and tilt
datasets.
# TODO: Replace with the /path/to/your/real/tilt.csv
tilt <- vroom(example_file("demo_tilt.csv"), show_col_types = FALSE)
# TODO: Replace with the /path/to/your/real/loanbook.csv
loanbook <- vroom(example_file("demo_loanbook.csv"), show_col_types = FALSE)
Check and fix your data first – before engaging in expensive computations.
loanbook %>% check_loanbook()
#> Found duplicate(s) on columns company_name, postcode, country of the data set.
#> ✖ Found for the company Peasant Peter, postcode: 01234, country: germany
#> ℹ Please check if these duplicates are intended and have an unique id.
The columns postcode
and country
will save
you time but aren’t crucial. If you don’t have them, fake them and move
on. For example, if you lack postcode
fake it with
mutate(loanbook, postcode = NA)
.
More chunks take longer to run but use less memory; you’ll need to experiment to find the number of chunks that best works for you.
chunks <- 3
chunked <- loanbook %>% mutate(chunk = as.integer(cut(row_number(), chunks)))
# This "output" directory is temporary; you may use any folder in your computer
out <- path(tempdir(), "output")
dir_create(out)
# In your console this shows a progress bar
pb <- progress::progress_bar$new(total = chunks)
for (i in unique(chunked$chunk)) {
pb$tick()
# 1. Match this chunk against the entire `tilt` dataset.
this_chunk <- filter(chunked, chunk == i)
this_result <- suppressMessages(suggest_match(this_chunk, tilt))
# 2. If this chunk matched nothing, move to the next chunk
matched_nothing <- nrow(this_result) == 0L
if (matched_nothing) next()
# 3. Else, save the result to a .csv file.
vroom_write(this_result, path(out, paste0(i, ".csv")), delim = ",")
}
Read the output of each chunk into a single dataframe.
files <- dir_ls(out)
# You may read only some files, say the first n, with: `vroom(files[1:n])`
to_edit <- vroom(files, show_col_types = FALSE)
to_edit
#> # A tibble: 18 × 16
#> id company…¹ postc…² country misc_…³ chunk compa…⁴ id_tilt compa…⁵ misc_…⁶
#> <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <chr> <chr>
#> 1 1 Peasant … 01234 germany A 1 peasan… 1 Peasan… A
#> 2 1 Peasant … 01234 germany A 1 peasan… 2 Peasan… Z
#> 3 1 Peasant … 01234 germany A 1 peasan… 4 Peasan… B
#> 4 2 Peasant … 01234 germany Z 1 peasan… 1 Peasan… A
#> 5 2 Peasant … 01234 germany Z 1 peasan… 2 Peasan… Z
#> 6 2 Peasant … 01234 germany Z 1 peasan… 4 Peasan… B
#> 7 3 Peasant … 11234 germany Z 1 peasan… 3 Peasan… Z
#> 8 4 Peasant … 01234 germany Z 1 peasan… 4 Peasan… B
#> 9 4 Peasant … 01234 germany Z 1 peasan… 1 Peasan… A
#> 10 4 Peasant … 01234 germany Z 1 peasan… 2 Peasan… Z
#> 11 5 Bread Ba… 23456 germany C 2 breadb… 5 The Br… C
#> 12 6 Flower P… 34567 germany Z 2 flower… 7 Flower… F
#> 13 6 Flower P… 34567 germany Z 2 flower… 6 Flower… D
#> 14 7 Screwdri… 45678 germany D 2 screwd… NA NA NA
#> 15 8 Screwdri… 45678 germany Z 2 screwd… NA NA NA
#> 16 9 John Mei… 56789 germany E 3 johnme… 8 John a… E
#> 17 10 John Mei… 55555 germany Y 3 johnme… NA NA NA
#> 18 11 John Mei… 55555 norway Y 3 johnme… NA NA NA
#> # … with 6 more variables: company_alias_tilt <chr>, postcode_tilt <lgl>,
#> # country_tilt <lgl>, similarity <dbl>, suggest_match <lgl>,
#> # accept_match <lgl>, and abbreviated variable names ¹company_name,
#> # ²postcode, ³misc_info, ⁴company_alias, ⁵company_name_tilt, ⁶misc_info_tilt
Write it back to a single .csv.
to_edit %>% vroom_write("to_edit.csv", delim = ",")
Continue with steps 2 and 3 as usual.