Suggest matching companies in a loanbook
and tilt
datasets
suggest_match.Rd
This function suggests that a company in your loanbook
is the same as a
company in the tilt
dataset when the similarity
between their names meets
all of these conditions:
It's the highest among all other candidates.
It's above the value set in the argument
suggestion_threshold
.It's the only such highest value in the group defined by a combination of
company_name
xpostcode
-- to avoid duplicates.
Arguments
- loanbook
A
loanbook
dataframe like demo_loanbook.- tilt
A
tilt
dataframe like demo_tilt.- eligibility_threshold
Minimum value of
similarity
to keep a candidate match. Values under it are most likely false positives and thus dropped. This drastically reduce the number of candidates you'll need to validate manually. We believe this benefit outweighs the potential loss of a few true positives.- suggestion_threshold
Value of
similarity
above which a match may be suggested.
Value
A dataframe with:
All the columns from the
loanbook
dataset.All the columns from the
tilt
dataset butthe columnsid
,company_name
,postcode
andcountry
all get the suffix "_tilt".New columns:
company_alias
company_alias_tilt
similarity
suggest_match
accept_match
. For each company in theloanbook
matching candidates are arranged by descendingsimilarity
.
Details
This function calculates the similarity between a standardized alias of the
company_name
from the loanbook
and tilt
datasets. The standardized
alias makes real matches more likely by applying common best practices in
names matching. Complete similarity corresponds to 1
, and complete
dissimilarity corresponds to 0
.
The columns postcode
and country
affect the quality of the matches and
the amount of manual-validation work ahead:
If your
loanbook
has bothpostcode
andcountry
we match companies in that specificpostcode
and that specificcountry
. You will likely match companies that are really the same (true positives) because it's unlikely that two companies with similar name will be located close to each other. This will cost you the minimum amount of manual-validation work ahead.If your
loanbook
lackspostcode
but hascountry
we match companies in that specificcountry
but across everypostcode
. You will possibly match companies that are not really the same (false positives) but happen to have a similar name and are located in the samecountry
. This will cost you additional manual-validation work ahead.If your
loanbook
haspostcode
but lackscountry
we match companies with the samepostcode
but across everycountry
. You will possibly match companies that are not really the same (false positives) but happen to have a similar name and the same postcode. This will cost you additional manual-validation work ahead.If your
loanbook
lacks bothpostcode
andcountry
we match companies across the entire dataset. You will most likely match companies that are not really the same (false positives). This will cost you the greatest amount of additional manual-validation work ahead.
Examples
library(vroom)
loanbook <- vroom(example_file("demo_loanbook.csv"), show_col_types = FALSE)
tilt <- vroom(example_file("demo_tilt.csv"), show_col_types = FALSE)
suggest_match(loanbook, tilt)
#> # A tibble: 18 × 15
#> id compa…¹ postc…² country misc_…³ compa…⁴ id_tilt compa…⁵ misc_…⁶ compa…⁷
#> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
#> 1 1 Peasan… 01234 germany A peasan… 1 Peasan… A peasan…
#> 2 1 Peasan… 01234 germany A peasan… 2 Peasan… Z peasan…
#> 3 1 Peasan… 01234 germany A peasan… 4 Peasan… B peasan…
#> 4 2 Peasan… 01234 germany Z peasan… 1 Peasan… A peasan…
#> 5 2 Peasan… 01234 germany Z peasan… 2 Peasan… Z peasan…
#> 6 2 Peasan… 01234 germany Z peasan… 4 Peasan… B peasan…
#> 7 3 Peasan… 11234 germany Z peasan… 3 Peasan… Z peasan…
#> 8 4 Peasan… 01234 germany Z peasan… 4 Peasan… B peasan…
#> 9 4 Peasan… 01234 germany Z peasan… 1 Peasan… A peasan…
#> 10 4 Peasan… 01234 germany Z peasan… 2 Peasan… Z peasan…
#> 11 5 Bread … 23456 germany C breadb… 5 The Br… C thebre…
#> 12 6 Flower… 34567 germany Z flower… 7 Flower… F flower…
#> 13 6 Flower… 34567 germany Z flower… 6 Flower… D flower…
#> 14 7 Screwd… 45678 germany D screwd… NA NA NA NA
#> 15 8 Screwd… 45678 germany Z screwd… NA NA NA NA
#> 16 9 John M… 56789 germany E johnme… 8 John a… E johnja…
#> 17 10 John M… 55555 germany Y johnme… NA NA NA NA
#> 18 11 John M… 55555 norway Y johnme… NA NA NA NA
#> # … with 5 more variables: postcode_tilt <chr>, country_tilt <chr>,
#> # similarity <dbl>, suggest_match <lgl>, accept_match <lgl>, and abbreviated
#> # variable names ¹company_name, ²postcode, ³misc_info, ⁴company_alias,
#> # ⁵company_name_tilt, ⁶misc_info_tilt, ⁷company_alias_tilt