name_*
columnsR/match_name.R
match_name.Rd
match_name()
scores the match between names in a loanbook dataset (columns
can be name_direct_loantaker
, name_intermediate_parent*
and
name_ultimate_parent
) with names in an asset-level dataset (column
name_company
). The raw names are first internally transformed, and aliases
are assigned. The similarity between aliases in each of the loanbook and ald
datasets is scored using stringdist::stringsim()
.
match_name( loanbook, ald, by_sector = TRUE, min_score = 0.8, method = "jw", p = 0.1, overwrite = NULL, ... )
loanbook, ald | data frames structured like r2dii.data::loanbook_demo and r2dii.data::ald_demo. |
---|---|
by_sector | Should names only be compared if companies belong to the
same |
min_score | A number between 0-1, to set the minimum |
method | Method for distance calculation. One of |
p | Prefix factor for Jaro-Winkler distance. The valid range for
|
overwrite | A data frame used to overwrite the |
... | Arguments passed on to |
A data frame with the same groups (if any) and columns as loanbook
,
and the additional columns:
id_2dii
- an id used internally by match_name()
to distinguish
companies
level
- the level of granularity that the loan was matched at
(e.g direct_loantaker
or ultimate_parent
)
sector
- the sector of the loanbook
company
sector_ald
- the sector of the ald
company
name
- the name of the loanbook
company
name_ald
- the name of the ald
company
score
- the score of the match (manually set this to 1
prior to calling prioritize()
to validate the match)
source
- determines the source of the match. (equal to loanbook
unless the match is from overwrite
The returned rows depend on the argument min_value
and the result of the
column score
for each loan: * If any row has score
equal to 1,
match_name()
returns all rows where score
equals 1, dropping all other
rows. * If no row has score
equal to 1,match_name()
returns all rows
where score
is equal to or greater than min_score
. * If there is no
match the output is a 0-row tibble with the expected column names -- for
type stability.
The transformation process used to compare names between loanbook and ald datasets applies best practices commonly used in name matching algorithms:
Remove special characters.
Replace language specific characters.
Abbreviate certain names to reduce their importance in the matching.
Spell out numbers to increase their importance.
This function ignores but preserves existing groups.
Other main functions:
prioritize()
library(r2dii.data) # Small data for examples loanbook <- head(loanbook_demo, 50) ald <- head(ald_demo, 50) match_name(loanbook, ald)#> # A tibble: 2 x 28 #> id_loan id_direct_loant… name_direct_loa… id_intermediate… name_intermedia… #> <chr> <chr> <chr> <chr> <chr> #> 1 L14 C296 Yuasfnjiang Ele… NA NA #> 2 L15 C295 Yuanbsaoshan Po… NA NA #> # … with 23 more variables: id_ultimate_parent <chr>, #> # name_ultimate_parent <chr>, loan_size_outstanding <dbl>, #> # loan_size_outstanding_currency <chr>, loan_size_credit_limit <dbl>, #> # loan_size_credit_limit_currency <chr>, sector_classification_system <chr>, #> # sector_classification_input_type <chr>, #> # sector_classification_direct_loantaker <dbl>, fi_type <chr>, #> # flag_project_finance_loan <chr>, name_project <lgl>, #> # lei_direct_loantaker <lgl>, isin_direct_loantaker <lgl>, id_2dii <chr>, #> # level <chr>, sector <chr>, sector_ald <chr>, name <chr>, name_ald <chr>, #> # score <dbl>, source <chr>, borderline <lgl>match_name(loanbook, ald, min_score = 0.9)#> # A tibble: 1 x 28 #> id_loan id_direct_loant… name_direct_loa… id_intermediate… name_intermedia… #> <chr> <chr> <chr> <chr> <chr> #> 1 L14 C296 Yuasfnjiang Ele… NA NA #> # … with 23 more variables: id_ultimate_parent <chr>, #> # name_ultimate_parent <chr>, loan_size_outstanding <dbl>, #> # loan_size_outstanding_currency <chr>, loan_size_credit_limit <dbl>, #> # loan_size_credit_limit_currency <chr>, sector_classification_system <chr>, #> # sector_classification_input_type <chr>, #> # sector_classification_direct_loantaker <dbl>, fi_type <chr>, #> # flag_project_finance_loan <chr>, name_project <lgl>, #> # lei_direct_loantaker <lgl>, isin_direct_loantaker <lgl>, id_2dii <chr>, #> # level <chr>, sector <chr>, sector_ald <chr>, name <chr>, name_ald <chr>, #> # score <dbl>, source <chr>, borderline <lgl>