Suggest matching companies in a loanbook and tilt datasets
suggest_match.RdThis function suggests that a company in your loanbook is the same as a
company in the tilt dataset when the similarity between their names meets
all of these conditions:
It's the highest among all other candidates.
It's above the value set in the argument
suggestion_threshold.It's the only such highest value in the group defined by a combination of
company_namexpostcode-- to avoid duplicates.
Arguments
- loanbook
A
loanbookdataframe like demo_loanbook.- tilt
A
tiltdataframe like demo_tilt.- eligibility_threshold
Minimum value of
similarityto keep a candidate match. Values under it are most likely false positives and thus dropped. This drastically reduce the number of candidates you'll need to validate manually. We believe this benefit outweighs the potential loss of a few true positives.- suggestion_threshold
Value of
similarityabove which a match may be suggested.
Value
A dataframe with:
All the columns from the
loanbookdataset.All the columns from the
tiltdataset butthe columnsid,company_name,postcodeandcountryall get the suffix "_tilt".New columns:
company_aliascompany_alias_tiltsimilaritysuggest_matchaccept_match. For each company in theloanbookmatching candidates are arranged by descendingsimilarity.
Details
This function calculates the similarity between a standardized alias of the
company_name from the loanbook and tilt datasets. The standardized
alias makes real matches more likely by applying common best practices in
names matching. Complete similarity corresponds to 1, and complete
dissimilarity corresponds to 0.
The columns postcode and country affect the quality of the matches and
the amount of manual-validation work ahead:
If your
loanbookhas bothpostcodeandcountrywe match companies in that specificpostcodeand that specificcountry. You will likely match companies that are really the same (true positives) because it's unlikely that two companies with similar name will be located close to each other. This will cost you the minimum amount of manual-validation work ahead.If your
loanbooklackspostcodebut hascountrywe match companies in that specificcountrybut across everypostcode. You will possibly match companies that are not really the same (false positives) but happen to have a similar name and are located in the samecountry. This will cost you additional manual-validation work ahead.If your
loanbookhaspostcodebut lackscountrywe match companies with the samepostcodebut across everycountry. You will possibly match companies that are not really the same (false positives) but happen to have a similar name and the same postcode. This will cost you additional manual-validation work ahead.If your
loanbooklacks bothpostcodeandcountrywe match companies across the entire dataset. You will most likely match companies that are not really the same (false positives). This will cost you the greatest amount of additional manual-validation work ahead.
Examples
library(vroom)
loanbook <- vroom(example_file("demo_loanbook.csv"), show_col_types = FALSE)
tilt <- vroom(example_file("demo_tilt.csv"), show_col_types = FALSE)
suggest_match(loanbook, tilt)
#> # A tibble: 18 × 15
#> id compa…¹ postc…² country misc_…³ compa…⁴ id_tilt compa…⁵ misc_…⁶ compa…⁷
#> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
#> 1 1 Peasan… 01234 germany A peasan… 1 Peasan… A peasan…
#> 2 1 Peasan… 01234 germany A peasan… 2 Peasan… Z peasan…
#> 3 1 Peasan… 01234 germany A peasan… 4 Peasan… B peasan…
#> 4 2 Peasan… 01234 germany Z peasan… 1 Peasan… A peasan…
#> 5 2 Peasan… 01234 germany Z peasan… 2 Peasan… Z peasan…
#> 6 2 Peasan… 01234 germany Z peasan… 4 Peasan… B peasan…
#> 7 3 Peasan… 11234 germany Z peasan… 3 Peasan… Z peasan…
#> 8 4 Peasan… 01234 germany Z peasan… 4 Peasan… B peasan…
#> 9 4 Peasan… 01234 germany Z peasan… 1 Peasan… A peasan…
#> 10 4 Peasan… 01234 germany Z peasan… 2 Peasan… Z peasan…
#> 11 5 Bread … 23456 germany C breadb… 5 The Br… C thebre…
#> 12 6 Flower… 34567 germany Z flower… 7 Flower… F flower…
#> 13 6 Flower… 34567 germany Z flower… 6 Flower… D flower…
#> 14 7 Screwd… 45678 germany D screwd… NA NA NA NA
#> 15 8 Screwd… 45678 germany Z screwd… NA NA NA NA
#> 16 9 John M… 56789 germany E johnme… 8 John a… E johnja…
#> 17 10 John M… 55555 germany Y johnme… NA NA NA NA
#> 18 11 John M… 55555 norway Y johnme… NA NA NA NA
#> # … with 5 more variables: postcode_tilt <chr>, country_tilt <chr>,
#> # similarity <dbl>, suggest_match <lgl>, accept_match <lgl>, and abbreviated
#> # variable names ¹company_name, ²postcode, ³misc_info, ⁴company_alias,
#> # ⁵company_name_tilt, ⁶misc_info_tilt, ⁷company_alias_tilt