The package r2dii.match helps you to match counterparties from a loanbook to companies in a physical-asset database. Each section below shows you how.
We use the package r2dii.match to access the most important functions you’ll learn about. We also use example datasets from the package r2dii.data, and optional but convenient functions from the packages dplyr and readr.
library(dplyr, warn.conflicts = FALSE)
library(r2dii.data)
library(r2dii.match)
We need two datasets: a “loanbook” and an “asset-level dataset” (ald). These should be formatted like: loanbook_demo
and ald_demo
(from the r2dii.data package).
A note on sector classification: Matches are preferred when the sector from the loanbook
matches the sector from the ald
. The loanbook
sector is determined internally using the sector_classification_system
and sector_classification_direct_loantaker
columns. Currently, we only allow a couple specific values for sector_classification_system
:
sector_classifications$code_system %>%
unique()
#> [1] "CNB" "GICS" "ISIC" "NACE" "NAICS" "SIC"
If you would like to use a different classification system, please raise an issue in r2dii.data and we can incorporate it.
loanbook_demo
#> # A tibble: 320 x 19
#> id_loan id_direct_loant… name_direct_loa… id_intermediate… name_intermedia…
#> <chr> <chr> <chr> <chr> <chr>
#> 1 L1 C294 Yuamen Xinneng … <NA> <NA>
#> 2 L2 C293 Yuamen Changyua… <NA> <NA>
#> 3 L3 C292 Yuama Ethanol L… IP5 Yuama Inc.
#> 4 L4 C299 Yudaksel Holdin… <NA> <NA>
#> 5 L5 C305 Yukon Energy Co… <NA> <NA>
#> 6 L6 C304 Yukon Developme… <NA> <NA>
#> 7 L7 C227 Yaugoa-Zapadnay… <NA> <NA>
#> 8 L8 C303 Yueyang City Co… <NA> <NA>
#> 9 L9 C301 Yuedxiu Corp One IP10 Yuedxiu Group
#> 10 L10 C302 Yuexi County AA… <NA> <NA>
#> # … with 310 more rows, and 14 more variables: id_ultimate_parent <chr>,
#> # name_ultimate_parent <chr>, loan_size_outstanding <dbl>,
#> # loan_size_outstanding_currency <chr>, loan_size_credit_limit <dbl>,
#> # loan_size_credit_limit_currency <chr>, sector_classification_system <chr>,
#> # sector_classification_input_type <chr>,
#> # sector_classification_direct_loantaker <dbl>, fi_type <chr>,
#> # flag_project_finance_loan <chr>, name_project <lgl>,
#> # lei_direct_loantaker <lgl>, isin_direct_loantaker <lgl>
ald_demo
#> # A tibble: 17,368 x 13
#> name_company sector technology production_unit year production
#> <chr> <chr> <chr> <chr> <int> <dbl>
#> 1 aba hydropo… power hydrocap MW 2013 133340.
#> 2 aba hydropo… power hydrocap MW 2014 131582.
#> 3 aba hydropo… power hydrocap MW 2015 129824.
#> 4 aba hydropo… power hydrocap MW 2016 128065.
#> 5 aba hydropo… power hydrocap MW 2017 126307.
#> 6 aba hydropo… power hydrocap MW 2018 124549.
#> 7 aba hydropo… power hydrocap MW 2019 122790.
#> 8 aba hydropo… power hydrocap MW 2020 121032.
#> 9 aba hydropo… power hydrocap MW 2021 119274.
#> 10 aba hydropo… power hydrocap MW 2022 117515.
#> # … with 17,358 more rows, and 7 more variables: emission_factor <dbl>,
#> # country_of_domicile <chr>, plant_location <chr>, is_ultimate_owner <lgl>,
#> # is_ultimate_listed_owner <lgl>, ald_timestamp <chr>,
#> # ald_emission_factor_unit <chr>
If you want to use loanbook_demo
and ald_demo
as template to create your own datasets, do this:
# Writting to current working directory
loanbook_demo %>%
readr::write_csv(path = "loanbook_demo.csv")
ald_demo %>%
readr::write_csv(path = "ald_demo.csv")
# Reading from current working directory
your_loanbook <- readr::read_csv("your_loanbook.csv")
your_ald <- readr::read_csv("your_ald.csv")
Here we continue to use the *_demo
datasets, pretending they contain the data of your own.
# WARNING: Skip this to avoid overwriting your data with our demo data
your_loanbook <- loanbook_demo
your_ald <- ald_demo
match_name()
scores the match between names in a loanbook dataset (lbk) and names in an asset-level dataset (ald). The names come from the columns name_direct_loantaker
, name_intermediate_parent_*
and name_ultimate_parent
of the loanbook dataset, and from the column name_company
of the a asset-level dataset. There can be any number of name_intermediate_parent_*
columns, where *
indicates the level up the corporate tree from direct_loantaker
.
The raw names are internally transformed applying best-practices commonly used in name matching algorithms, such as:
The similarity is then scored between the internally-transformed names of the loanbook against the ald. (For more information on the scoring algorithm used, see: stringdist::stringsim()
).
match_name(your_loanbook, your_ald)
#> # A tibble: 408 x 28
#> id_loan id_direct_loant… name_direct_loa… id_intermediate… name_intermedia…
#> <chr> <chr> <chr> <chr> <chr>
#> 1 L1 C294 Yuamen Xinneng … <NA> <NA>
#> 2 L3 C292 Yuama Ethanol L… IP5 Yuama Inc.
#> 3 L3 C292 Yuama Ethanol L… IP5 Yuama Inc.
#> 4 L5 C305 Yukon Energy Co… <NA> <NA>
#> 5 L5 C305 Yukon Energy Co… <NA> <NA>
#> 6 L6 C304 Yukon Developme… <NA> <NA>
#> 7 L6 C304 Yukon Developme… <NA> <NA>
#> 8 L8 C303 Yueyang City Co… <NA> <NA>
#> 9 L9 C301 Yuedxiu Corp One IP10 Yuedxiu Group
#> 10 L10 C302 Yuexi County AA… <NA> <NA>
#> # … with 398 more rows, and 23 more variables: id_ultimate_parent <chr>,
#> # name_ultimate_parent <chr>, loan_size_outstanding <dbl>,
#> # loan_size_outstanding_currency <chr>, loan_size_credit_limit <dbl>,
#> # loan_size_credit_limit_currency <chr>, sector_classification_system <chr>,
#> # sector_classification_input_type <chr>,
#> # sector_classification_direct_loantaker <dbl>, fi_type <chr>,
#> # flag_project_finance_loan <chr>, name_project <lgl>,
#> # lei_direct_loantaker <lgl>, isin_direct_loantaker <lgl>, id_2dii <chr>,
#> # level <chr>, sector <chr>, sector_ald <chr>, name <chr>, name_ald <chr>,
#> # score <dbl>, source <chr>, borderline <lgl>
match_name()
defaults to scoring matches between name strings that belong to the same sector. Using by_sector = FALSE
removes this limitation – increasing computation time, and the number of potentially incorrect matches to manually validate.
match_name(your_loanbook, your_ald, by_sector = FALSE) %>%
nrow()
#> [1] 673
# Compare
match_name(your_loanbook, your_ald, by_sector = TRUE) %>%
nrow()
#> [1] 408
min_score
allows you to minimum threshold score
.
matched <- match_name(your_loanbook, your_ald, min_score = 0.9)
range(matched$score)
#> [1] 0.9058824 1.0000000
If you are happy with the matching coverage achieved, proceed to the next step. Otherwise, you can manually add matches, not found automatically by match_name()
. To do this, manually inspect the ald
and find a company you would like to match to your loanbook. Once a match is found, use excel to write a .csv file similar to overwrite_demo
, where:
level
indicates the level that the manual match should be added to (e.g. direct_loantaker
)id_2dii
is the id of the loanbook company you would like to match (from the output of match_name()
)name
is the ald company you would like to manually link tosector
optionally you can also overwrite the sector.source
this can be used later to determine where all manual matches came from.
matched <- match_name(
your_loanbook, your_ald,
min_score = 0.9, overwrite = overwrite_demo
)
#> Warning: You should only overwrite a sector at the level of the 'direct
#> loantaker' (DL). If you overwrite a sector at the level of the 'ultimate
#> parent' (UP) you consequently overwrite all children of that sector,
#> which most likely is a mistake.
Notice the warning.
Write the output of match_name()
into a .csv file with:
# Writting to current working directory
matched %>%
readr::write_csv("matched.csv")
Compare, edit, and save the data manually:
name
and name_ald
manually to determine if the match is valid. Other information can be used in conjunction with just the names to ensure the two entities match (sector, internal information on the company structure, etc.)score
value to 1
.score
value to anything other than 1
.Re-read the edited file (validated) with:
# Reading from current working directory
valid_matches <- readr::read_csv("valid_matches.csv")
The validated dataset may have multiple matches per loan. Consider the case where a loan is given to “Acme Power USA”, a subsidiary of “Acme Power Co.”. There may be both “Acme Power USA” and “Acme Power Co.” in the ald
, and so there could be two valid matches for this loan. To get the best match only, use prioritize()
– it picks rows where score
is 1 and level
per loan is of highest priority()
:
# Pretend we validated the matched dataset
valid_matches <- matched
some_interesting_columns <- vars(id_2dii, level, score)
valid_matches %>%
prioritize() %>%
select(!!!some_interesting_columns)
#> # A tibble: 217 x 3
#> id_2dii level score
#> <chr> <chr> <dbl>
#> 1 DL294 direct_loantaker 1
#> 2 DL304 direct_loantaker 1
#> 3 DL297 direct_loantaker 1
#> 4 DL287 direct_loantaker 1
#> 5 DL286 direct_loantaker 1
#> 6 DL285 direct_loantaker 1
#> 7 DL283 direct_loantaker 1
#> 8 DL282 direct_loantaker 1
#> 9 DL281 direct_loantaker 1
#> 10 DL280 direct_loantaker 1
#> # … with 207 more rows
By default, highest priority refers to the most granular match (direct_loantaker
). The default priority is set internally via prioritize_levels()
.
prioritize_level(matched)
#> [1] "direct_loantaker" "intermediate_parent_1" "ultimate_parent"
You may use a different priority. One way to do that is to pass a function to priority
. For example, use rev
to reverse the default priority.
matched %>%
prioritize(priority = rev) %>%
select(!!!some_interesting_columns)
#> # A tibble: 217 x 3
#> id_2dii level score
#> <chr> <chr> <dbl>
#> 1 UP288 ultimate_parent 1
#> 2 UP104 ultimate_parent 1
#> 3 UP83 ultimate_parent 1
#> 4 UP163 ultimate_parent 1
#> 5 UP138 ultimate_parent 1
#> 6 UP32 ultimate_parent 1
#> 7 UP81 ultimate_parent 1
#> 8 UP269 ultimate_parent 1
#> 9 UP69 ultimate_parent 1
#> 10 UP3 ultimate_parent 1
#> # … with 207 more rows