match_name() scores the match between names in a loanbook dataset (columns can be name_direct_loantaker, name_intermediate_parent* and name_ultimate_parent) with names in an asset-level dataset (column name_company). The raw names are first internally transformed, and aliases are assigned. The similarity between aliases in each of the loanbook and ald datasets is scored using stringdist::stringsim().

match_name(
  loanbook,
  ald,
  by_sector = TRUE,
  min_score = 0.8,
  method = "jw",
  p = 0.1,
  overwrite = NULL,
  ...
)

Arguments

loanbook, ald

data frames structured like r2dii.data::loanbook_demo and r2dii.data::ald_demo.

by_sector

Should names only be compared if companies belong to the same sector?

min_score

A number between 0-1, to set the minimum score threshold. A score of 1 is a perfect match.

method

Method for distance calculation. One of c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"). See stringdist::stringdist-metrics.

p

Prefix factor for Jaro-Winkler distance. The valid range for p is 0 <= p <= 0.25. If p=0 (default), the Jaro-distance is returned. Applies only to method='jw'.

overwrite

A data frame used to overwrite the sector and/or name columns of a particular direct loantaker or ultimate parent. To overwrite only sector, the value in the name column should be NA and vice-versa. This file can be used to manually match loanbook companies to ald.

...

Arguments passed on to stringdist::stringsim().

Value

A data frame with the same groups (if any) and columns as loanbook, and the additional columns:

  • id_2dii - an id used internally by match_name() to distinguish companies

  • level - the level of granularity that the loan was matched at (e.g direct_loantaker or ultimate_parent)

  • sector - the sector of the loanbook company

  • sector_ald - the sector of the ald company

  • name - the name of the loanbook company

  • name_ald - the name of the ald company

  • score - the score of the match (manually set this to 1 prior to calling prioritize() to validate the match)

  • source - determines the source of the match. (equal to loanbook unless the match is from overwrite

The returned rows depend on the argument min_value and the result of the column score for each loan: * If any row has score equal to 1, match_name() returns all rows where score equals 1, dropping all other rows. * If no row has score equal to 1,match_name() returns all rows where score is equal to or greater than min_score. * If there is no match the output is a 0-row tibble with the expected column names -- for type stability.

Package options

r2dii.match.sector_classifications: Allows you to use your own sector_classififications instead of the default. This feature is experimental and may be dropped and/or become a new argument to match_name().

Assigning aliases

The transformation process used to compare names between loanbook and ald datasets applies best practices commonly used in name matching algorithms:

  • Remove special characters.

  • Replace language specific characters.

  • Abbreviate certain names to reduce their importance in the matching.

  • Spell out numbers to increase their importance.

Handling grouped data

This function ignores but preserves existing groups.

See also

Other main functions: prioritize()

Examples

library(r2dii.data) library(tibble) # Small data for examples loanbook <- head(loanbook_demo, 50) ald <- head(ald_demo, 50) match_name(loanbook, ald)
#> # A tibble: 2 × 28 #> id_loan id_direct_loantaker name_direct_loa… id_intermediate… name_intermedia… #> <chr> <chr> <chr> <chr> <chr> #> 1 L14 C296 Yuasfnjiang Ele… NA NA #> 2 L15 C295 Yuanbsaoshan Po… NA NA #> # … with 23 more variables: id_ultimate_parent <chr>, #> # name_ultimate_parent <chr>, loan_size_outstanding <dbl>, #> # loan_size_outstanding_currency <chr>, loan_size_credit_limit <dbl>, #> # loan_size_credit_limit_currency <chr>, sector_classification_system <chr>, #> # sector_classification_input_type <chr>, #> # sector_classification_direct_loantaker <dbl>, fi_type <chr>, #> # flag_project_finance_loan <chr>, name_project <lgl>, …
match_name(loanbook, ald, min_score = 0.9)
#> # A tibble: 1 × 28 #> id_loan id_direct_loantaker name_direct_loa… id_intermediate… name_intermedia… #> <chr> <chr> <chr> <chr> <chr> #> 1 L14 C296 Yuasfnjiang Ele… NA NA #> # … with 23 more variables: id_ultimate_parent <chr>, #> # name_ultimate_parent <chr>, loan_size_outstanding <dbl>, #> # loan_size_outstanding_currency <chr>, loan_size_credit_limit <dbl>, #> # loan_size_credit_limit_currency <chr>, sector_classification_system <chr>, #> # sector_classification_input_type <chr>, #> # sector_classification_direct_loantaker <dbl>, fi_type <chr>, #> # flag_project_finance_loan <chr>, name_project <lgl>, …
# Use your own `sector_classifications` your_classifications <- tibble( sector = "power", borderline = FALSE, code = "3511", code_system = "XYZ" ) restore <- options(r2dii.match.sector_classifications = your_classifications) loanbook <- tibble( sector_classification_system = "XYZ", sector_classification_direct_loantaker = "3511", id_ultimate_parent = "UP15", name_ultimate_parent = "Alpine Knits India Pvt. Limited", id_direct_loantaker = "C294", name_direct_loantaker = "Yuamen Xinneng Thermal Power Co Ltd" ) ald <- tibble( name_company = "alpine knits india pvt. limited", sector = "power", alias_ald = "alpineknitsindiapvt ltd" ) match_name(loanbook, ald)
#> # A tibble: 1 × 15 #> sector_classificat… sector_classificatio… id_ultimate_par… name_ultimate_pare… #> <chr> <chr> <chr> <chr> #> 1 XYZ 3511 UP15 Alpine Knits India… #> # … with 11 more variables: id_direct_loantaker <chr>, #> # name_direct_loantaker <chr>, id_2dii <chr>, level <chr>, sector <chr>, #> # sector_ald <chr>, name <chr>, name_ald <chr>, score <dbl>, source <chr>, #> # borderline <lgl>
# Cleanup options(restore)