Handling a large loanbook • tilt.company.match

If your loanbook is large, or if you lack postcode or country, your computer may run out of memory and crash. This article shows an alternative way to run the first step of the matching process: Suggesting matches. You’ll check the quality of your loanbook before expensive computations, and match multiple small chunks of your loanbook one by one (like here).

library(dplyr, warn.conflicts = FALSE)
library(vroom)
library(fs)
library(tilt.company.match)

Read your loanbook and tilt datasets.

# TODO: Replace with the /path/to/your/real/tilt.csv
tilt <- vroom(example_file("demo_tilt.csv"), show_col_types = FALSE)

# TODO: Replace with the /path/to/your/real/loanbook.csv
loanbook <- vroom(example_file("demo_loanbook.csv"), show_col_types = FALSE)

Check and fix your data first – before engaging in expensive computations.

loanbook %>% check_loanbook()
#> Found duplicate(s) on columns company_name, postcode, country of the data set.
#> ✖ Found for the company Peasant Peter, postcode: 01234, country: germany
#> ℹ Please check if these duplicates are intended and have an unique id.

The columns postcode and country will save you time but aren’t crucial. If you don’t have them, fake them and move on. For example, if you lack postcode fake it with mutate(loanbook, postcode = NA).

More chunks take longer to run but use less memory; you’ll need to experiment to find the number of chunks that best works for you.

chunks <- 3
chunked <- loanbook %>% mutate(chunk = as.integer(cut(row_number(), chunks)))

# This "output" directory is temporary; you may use any folder in your computer
out <- path(tempdir(), "output")
dir_create(out)

# In your console this shows a progress bar
pb <- progress::progress_bar$new(total = chunks)
for (i in unique(chunked$chunk)) {
  pb$tick()

  # 1. Match this chunk against the entire `tilt` dataset.
  this_chunk <- filter(chunked, chunk == i)
  this_result <- suppressMessages(suggest_match(this_chunk, tilt))

  # 2. If this chunk matched nothing, move to the next chunk
  matched_nothing <- nrow(this_result) == 0L
  if (matched_nothing) next()

  # 3. Else, save the result to a .csv file.
  vroom_write(this_result, path(out, paste0(i, ".csv")), delim = ",")
}

Read the output of each chunk into a single dataframe.

files <- dir_ls(out)
# You may read only some files, say the first n, with: `vroom(files[1:n])`
to_edit <- vroom(files, show_col_types = FALSE)
to_edit
#> # A tibble: 18 × 16
#>       id company…¹ postc…² country misc_…³ chunk compa…⁴ id_tilt compa…⁵ misc_…⁶
#>    <dbl> <chr>     <chr>   <chr>   <chr>   <dbl> <chr>     <dbl> <chr>   <chr>  
#>  1     1 Peasant … 01234   germany A           1 peasan…       1 Peasan… A      
#>  2     1 Peasant … 01234   germany A           1 peasan…       2 Peasan… Z      
#>  3     1 Peasant … 01234   germany A           1 peasan…       4 Peasan… B      
#>  4     2 Peasant … 01234   germany Z           1 peasan…       1 Peasan… A      
#>  5     2 Peasant … 01234   germany Z           1 peasan…       2 Peasan… Z      
#>  6     2 Peasant … 01234   germany Z           1 peasan…       4 Peasan… B      
#>  7     3 Peasant … 11234   germany Z           1 peasan…       3 Peasan… Z      
#>  8     4 Peasant … 01234   germany Z           1 peasan…       4 Peasan… B      
#>  9     4 Peasant … 01234   germany Z           1 peasan…       1 Peasan… A      
#> 10     4 Peasant … 01234   germany Z           1 peasan…       2 Peasan… Z      
#> 11     5 Bread Ba… 23456   germany C           2 breadb…       5 The Br… C      
#> 12     6 Flower P… 34567   germany Z           2 flower…       7 Flower… F      
#> 13     6 Flower P… 34567   germany Z           2 flower…       6 Flower… D      
#> 14     7 Screwdri… 45678   germany D           2 screwd…      NA NA      NA     
#> 15     8 Screwdri… 45678   germany Z           2 screwd…      NA NA      NA     
#> 16     9 John Mei… 56789   germany E           3 johnme…       8 John a… E      
#> 17    10 John Mei… 55555   germany Y           3 johnme…      NA NA      NA     
#> 18    11 John Mei… 55555   norway  Y           3 johnme…      NA NA      NA     
#> # … with 6 more variables: company_alias_tilt <chr>, postcode_tilt <lgl>,
#> #   country_tilt <lgl>, similarity <dbl>, suggest_match <lgl>,
#> #   accept_match <lgl>, and abbreviated variable names ¹company_name,
#> #   ²postcode, ³misc_info, ⁴company_alias, ⁵company_name_tilt, ⁶misc_info_tilt

Write it back to a single .csv.

to_edit %>% vroom_write("to_edit.csv", delim = ",")

Continue with steps 2 and 3 as usual.