Suggest matching companies in a loanbook and tilt datasets — suggest

This function suggests that a company in your loanbook is the same as a company in the tilt dataset when the similarity between their names meets all of these conditions:

It's the highest among all other candidates.
It's above the value set in the argument suggestion_threshold.
It's the only such highest value in the group defined by a combination of company_name x postcode -- to avoid duplicates.

Usage

suggest_match(
  loanbook,
  tilt,
  eligibility_threshold = 0.75,
  suggestion_threshold = 0.9
)

Arguments

loanbook: A loanbook dataframe like demo_loanbook.
tilt: A tilt dataframe like demo_tilt.
eligibility_threshold: Minimum value of similarity to keep a candidate match. Values under it are most likely false positives and thus dropped. This drastically reduce the number of candidates you'll need to validate manually. We believe this benefit outweighs the potential loss of a few true positives.
suggestion_threshold: Value of similarity above which a match may be suggested.

Value

A dataframe with:

All the columns from the loanbook dataset.
All the columns from the tilt dataset butthe columns id, company_name, postcode and country all get the suffix "_tilt".
New columns:
- company_alias
- company_alias_tilt
- similarity
- suggest_match
- accept_match. For each company in the loanbook matching candidates are arranged by descending similarity.

Details

This function calculates the similarity between a standardized alias of the company_name from the loanbook and tilt datasets. The standardized alias makes real matches more likely by applying common best practices in names matching. Complete similarity corresponds to 1, and complete dissimilarity corresponds to 0.

The columns postcode and country affect the quality of the matches and the amount of manual-validation work ahead:

If your loanbook has both postcode and country we match companies in that specific postcode and that specific country. You will likely match companies that are really the same (true positives) because it's unlikely that two companies with similar name will be located close to each other. This will cost you the minimum amount of manual-validation work ahead.
If your loanbook lacks postcode but has country we match companies in that specific country but across every postcode. You will possibly match companies that are not really the same (false positives) but happen to have a similar name and are located in the same country. This will cost you additional manual-validation work ahead.
If your loanbook has postcode but lacks country we match companies with the same postcode but across every country. You will possibly match companies that are not really the same (false positives) but happen to have a similar name and the same postcode. This will cost you additional manual-validation work ahead.
If your loanbook lacks both postcode and country we match companies across the entire dataset. You will most likely match companies that are not really the same (false positives). This will cost you the greatest amount of additional manual-validation work ahead.

Examples

library(vroom)
loanbook <- vroom(example_file("demo_loanbook.csv"), show_col_types = FALSE)
tilt <- vroom(example_file("demo_tilt.csv"), show_col_types = FALSE)

suggest_match(loanbook, tilt)
#> # A tibble: 18 × 15
#>       id compa…¹ postc…² country misc_…³ compa…⁴ id_tilt compa…⁵ misc_…⁶ compa…⁷
#>    <dbl> <chr>   <chr>   <chr>   <chr>   <chr>     <dbl> <chr>   <chr>   <chr>  
#>  1     1 Peasan… 01234   germany A       peasan…       1 Peasan… A       peasan…
#>  2     1 Peasan… 01234   germany A       peasan…       2 Peasan… Z       peasan…
#>  3     1 Peasan… 01234   germany A       peasan…       4 Peasan… B       peasan…
#>  4     2 Peasan… 01234   germany Z       peasan…       1 Peasan… A       peasan…
#>  5     2 Peasan… 01234   germany Z       peasan…       2 Peasan… Z       peasan…
#>  6     2 Peasan… 01234   germany Z       peasan…       4 Peasan… B       peasan…
#>  7     3 Peasan… 11234   germany Z       peasan…       3 Peasan… Z       peasan…
#>  8     4 Peasan… 01234   germany Z       peasan…       4 Peasan… B       peasan…
#>  9     4 Peasan… 01234   germany Z       peasan…       1 Peasan… A       peasan…
#> 10     4 Peasan… 01234   germany Z       peasan…       2 Peasan… Z       peasan…
#> 11     5 Bread … 23456   germany C       breadb…       5 The Br… C       thebre…
#> 12     6 Flower… 34567   germany Z       flower…       7 Flower… F       flower…
#> 13     6 Flower… 34567   germany Z       flower…       6 Flower… D       flower…
#> 14     7 Screwd… 45678   germany D       screwd…      NA NA      NA      NA     
#> 15     8 Screwd… 45678   germany Z       screwd…      NA NA      NA      NA     
#> 16     9 John M… 56789   germany E       johnme…       8 John a… E       johnja…
#> 17    10 John M… 55555   germany Y       johnme…      NA NA      NA      NA     
#> 18    11 John M… 55555   norway  Y       johnme…      NA NA      NA      NA     
#> # … with 5 more variables: postcode_tilt <chr>, country_tilt <chr>,
#> #   similarity <dbl>, suggest_match <lgl>, accept_match <lgl>, and abbreviated
#> #   variable names ¹company_name, ²postcode, ³misc_info, ⁴company_alias,
#> #   ⁵company_name_tilt, ⁶misc_info_tilt, ⁷company_alias_tilt