Get started (code) • tilt.company.match

This article is identical to “Get started” but the text is commented out so you can copy-paste and run it as a single .R script.
#' ---
#' title: "tilt.company.match"
#' editor_options: 
#'   chunk_output_type: console
#' output: github_document
#' ---
#' 
## ---- include = FALSE---------------------------------------------------------
knitr::opts_chunk$set(
  cache = TRUE,
  collapse = TRUE,
  comment = "#>"
)

#' 
#' This document explains how to match companies in a loanbook from a bank to
#' companies in a tilt dataset from 2 Degrees Investing Initiative.
#' 
#' ## Considerations
#' 
#' * The company's name may be different in each dataset, for example because of typos.
#' * The company's country and postcode are useful. A match by name, country, and postcode is more
#' reliable than a match by name alone.
#' * The company may not exist in the tilt dataset.
#' 
#' ## Data requirements
#' 
#' Both your loanbook and tilt dataset must meet these requirements:
#' 
#' * It's an R dataframe.
#' * It has the columns `id`, `company_name`, `postcode`, and `country`.
#' * It may or may not have other columns.
#' * The column `id` holds unique row-identifiers.
#' 
#' ## System requirements
#' 
#' * [Install R and RStudio](https://happygitwithr.com/install-r-rstudio.html).
#' * [Setup an R build toolchain](https://r-pkgs.org/setup.html#setup-tools).
#' 
#' ## 1. In R (session 1)
#' 
#' In this first R session you'll create a dataset with candidate matches for the
#' companies in your loanbook.
#' 
#' ### Install tilt.company.match
#' 
#' You can install
#' [tilt.company.match](https://github.com/2DegreesInvesting/tilt.company.match)
#' from [r-universe](https://r-universe.dev/) with:
#' 
#' ```r
#' options(repos = c("https://2degreesinvesting.r-universe.dev", getOption("repos")))
#' install.packages("tilt.company.match")
#' ```
#' 
#' Or you can install it from [GitHub](https://github.com/) with:
#' 
#' ``` r
#' # install.packages("devtools")
#' devtools::install_github("2DegreesInvesting/tilt.company.match")
#' ```
#' 
#' Call this function if things seem weird and you're not sure what's wrong or how
#' to fix it.
#' 
#' ```r
#' devtools::dev_sitrep()
#' ```
#' 
#' ### Use R packages
#' 
#' Installing tilt.company.match should automatically install other useful packages. You can use them all with:
#' 
## -----------------------------------------------------------------------------
library(dplyr, warn.conflicts = FALSE)
library(vroom, warn.conflicts = FALSE)
library(stringdist)
library(tilt.company.match)

#' 
#' ### Read the datasets
#' 
#' This example uses "demo" datasets. You should use your real `loanbook` and `tilt` datasets instead.
#' 
## -----------------------------------------------------------------------------
# TODO: Replace with the path/to/your/real/loanbook.csv
loanbook_csv <- example_file("demo_loanbook.csv")
loanbook_csv

loanbook <- vroom(loanbook_csv, show_col_types = FALSE)
loanbook

# TODO: Replace with the path/to/your/real/tilt.csv
tilt_csv <- example_file("demo_tilt.csv")
tilt_csv

tilt <- vroom(tilt_csv, show_col_types = FALSE)
tilt

#' 
#' ### Check data quality
#' 
#' Let's first check you `loanbook` is as we expect.
#' 
#' #### Expected columns
#' 
#' Do you have the expected columns `id`, `company_name`, `postcode`, and `country`?
#' 
## -----------------------------------------------------------------------------
expected <- c("id", "company_name", "postcode", "country")
loanbook %>% check_crucial_names(expected)

# Anything different throws an error
bad <- rename(loanbook, ids = id)
bad %>%
  check_crucial_names(expected) %>%
  try()

#' 
#' #### Duplicates
#' 
#' Do you have any duplicates in the column `id`?
#' 
## -----------------------------------------------------------------------------
has_no_duplicates <- identical(anyDuplicated(loanbook$id), 0L)
# If you get an error, remove the duplicates and try again
stopifnot(has_no_duplicates)

#' 
#' Do you have duplicates in `company_name`, `postcode` or `country`?
#' 
#' It's best if there is none. But if you find duplicates and they belong to
#' different companies, then you don't have to fix them.
#' 
## -----------------------------------------------------------------------------
best_without_duplicates <- c("company_name", "postcode", "country")
report_duplicates(loanbook, best_without_duplicates)

#' 
#' For example, here the column `misc_info` suggests the duplicates belong to
#' different companies, so it's OK:
#' 
## -----------------------------------------------------------------------------
loanbook %>%
  filter(company_name == "Peasant Peter") %>%
  filter(postcode == "01234")

#' 
#' #### Missing values
#' 
#' Do you have missing values (`NA`s) in non-nullable columns?
#' 
#' Non-nullable columns must not have missing values. If they do you have to remove
#' them. Missing values in other columns are fine.
#' 
## -----------------------------------------------------------------------------
non_nullable <- c("id", "company_name")
loanbook %>% abort_if_incomplete(non_nullable)

#' 
#' For example, here the non-nullable `id` column has one missing value:
#' 
## -----------------------------------------------------------------------------
bad_loanbook <- tribble(
  ~id, ~company_name, ~postcode, ~country, ~misc_info,
  NA, "John Meier's Groceries", "55555", "germany", "Y",
  11, "John Meier's Groceries", "55555", "norway", "Y"
)
bad_loanbook %>%
  abort_if_incomplete(non_nullable) %>%
  try()

fixed_loanbook <- filter(bad_loanbook, !is.na(id))
# NA's are OK in columns other than non-nullable ones
fixed_loanbook

fixed_loanbook %>% abort_if_incomplete(non_nullable)

#' 
#' ### Create a standard alias of `company_name` in both datasets
#' 
#' Use `to_alias()` to reduce the chance you'll miss a match because of spurious
#' differences in the company name between the loanbook and tilt dataset. This
#' helps you get a less noisy, more consistent version of `company_name` in each of
#' the two datasets.
#' 
## -----------------------------------------------------------------------------
loanbook_alias <- loanbook %>% mutate(company_alias = to_alias(company_name))
loanbook_alias

tilt_alias <- tilt %>% mutate(company_alias = to_alias(company_name))
tilt_alias

#' 
#' ### Match candidates
#' 
#' To inform the decision about which companies in your `loanbook` match companies in the `tilt`
#' dataset, we compare the values in the columns `postcode` and `country`:
#' 
#' * If your `loanbook` has both `postcode` and `country` we match companies in that specific `postcode` and that specific `country`. You will likely match companies that are really the same (true positives) because it's unlikely that two companies with similar name will be located close to each other. This will cost you the minimum amount of manual-validation work ahead.
#' 
## -----------------------------------------------------------------------------
lacks_none <- loanbook_alias %>%
  filter(!is.na(postcode) & !is.na(country)) %>%
  left_join(
    tilt_alias,
    by = c("country", "postcode"),
    suffix = c("", "_tilt"),
    multiple = "all"
  )

#' 
#' * If your `loanbook` lacks `postcode` but has `country` we match companies in that specific `country` but across every `postcode`. You will possibly match companies that are not really
#' the same (false positives) but happen to have a similar name and are located in
#' the same `country`. This will cost you additional manual-validation work ahead.
#' 
## -----------------------------------------------------------------------------
lacks_postcode <- loanbook_alias %>%
  filter(is.na(postcode) & !is.na(country)) %>%
  left_join(
    tilt_alias,
    by = c("country"),
    suffix = c("", "_tilt"),
    multiple = "all"
  )

#' 
#' * If your `loanbook` has `postcode` but lacks `country` we match companies with the same `postcode` but  across every `country`. You will possibly match companies that are not really
#' the same (false positives) but happen to have a similar name and the same
#' postcode. This will cost you additional manual-validation work ahead.
#' 
## -----------------------------------------------------------------------------
lacks_country <- loanbook_alias %>%
  filter(!is.na(postcode) & is.na(country)) %>%
  left_join(tilt_alias, by = c("postcode"), suffix = c("", "_tilt"))

#' 
#' * If your `loanbook` lacks both `postcode` and `country` we match companies across the entire dataset.  You will most likely match companies that are not really
#' the same (false positives). This will cost you the greatest amount of additional
#' manual-validation work ahead.
#' 
## -----------------------------------------------------------------------------
lacks_both <- loanbook_alias %>%
  filter(is.na(postcode) & is.na(country)) %>%
  mutate(postcode = "join_helper") %>%
  inner_join(
    dplyr::mutate(tilt_alias, postcode = "join_helper"),
    by = c("postcode"),
    suffix = c("", "_tilt"),
    multiple = "all"
  ) %>%
  mutate(postcode = NA_character_)

#' 
#' Having considered all cases, you can now combine them all in a single dataset:
#' 
## -----------------------------------------------------------------------------
candidates <- bind_rows(lacks_none, lacks_postcode, lacks_country, lacks_both)

candidates

#' 
#' Above each join allowed any one company in your `loanbook` to match `"all"` of
#' the potentially `multiple` companies the `tilt` dataset. Here, for example, one
#' company in our demo `loanbook` matches three candidates in our demo `tilt`
#' dataset:
#' 
## -----------------------------------------------------------------------------
candidates %>%
  filter(id == 1) %>%
  select(company_alias, id_tilt, company_alias_tilt)

#' 
#' Next, calculate the string similarity between the aliased `company_name` from
#' the loanbook and tilt datasets. Complete similarity corresponds to `1`, and
#' complete dissimilarity corresponds to `0`. For each company in the loanbook,
#' arrange matching candidates by descending similarity.
#' 
## -----------------------------------------------------------------------------
okay_candidates <- candidates %>%
  # Other parameters may perform best. See `?stringdist::stringsim`
  mutate(similarity = stringsim(
    company_alias, company_alias_tilt,
    # Good to compare human typed text that might have typos.
    method = "jw",
    p = 0.1
  )) %>%
  # Arrange matching candidates from more to less similar
  arrange(id, -similarity)

okay_candidates %>% relocate(similarity)

#' 
#' ### Pick best candidates
#' 
## -----------------------------------------------------------------------------
eligibility_threshold <- 0.75

#' 
#' Empirically we found that candidates under a `similarity` threshold of 
#' `r eligibility_threshold` are most likely false positives. Pick `similarity`
#' values above that threshold to drastically reduce the number of candidates
#' you'll need to validate manually. We believe this benefit outweighs the
#' potential loss of a few true positives.
#' 
## -----------------------------------------------------------------------------
best_candidates <- okay_candidates %>%
  filter(similarity > eligibility_threshold | is.na(similarity))

#' 
#' After picking the best candidates, some companies in your `loanbook` might no
#' longer have any candidate in the `tilt` dataset.
#' 
## -----------------------------------------------------------------------------
unmatched <- anti_join(
  okay_candidates %>% distinct(id, company_name),
  best_candidates %>% distinct(id, company_name)
)

unmatched

#' 
#' ### Suggest matches
#' 
## -----------------------------------------------------------------------------
# Decided upon extensive experience
suggestion_threshold <- 0.9

#' 
#' Later you'll need to manually decide which of all candidates if any is a true
#' match. To make that job easier, we can automatically make some suggestions in a
#' new column `suggest_match`.
#' 
#' The values of `suggest_match` are set to `TRUE` where the value of `similarity`
#' meets all of these conditions:
#' 
#' * It's the highest among all other candidates.
#' * It's above a threshold of `r suggestion_threshold`.
#' * It's the only such highest value in the group defined by a combination of `company_name` x `postcode` -- to avoid duplicates.
#' 
## -----------------------------------------------------------------------------
candidates_suggest_match <- best_candidates %>%
  # - It's the highest among all other candidates.
  group_by(id) %>%
  filter(similarity == max(similarity)) %>%
  # - It's above the threshold.
  filter(similarity > suggestion_threshold) %>%
  # - It's the only such highest value in the group defined by a combination of
  # `company_name` x `postcode` -- to avoid duplicates.
  mutate(duplicates = any(duplicated_paste(company_name, postcode))) %>%
  filter(!duplicates) %>%
  select(id, id_tilt) %>%
  mutate(suggest_match = TRUE) %>%
  ungroup()

#' 
#' In all other rows the value of `suggest_match` is automatically set to `NA`.
#' Also now create a new column `accept_match` and fill it with `NA`. Later you'll
#' edit this column.
#' 
## -----------------------------------------------------------------------------
to_edit <- best_candidates %>%
  left_join(candidates_suggest_match, by = c("id", "id_tilt")) %>%
  mutate(accept_match = NA)

to_edit %>% relocate(similarity, suggest_match)

#' 
#' Note that even a `similarity` of `1` in the same `postcode` can be a false
#' positive. For example, this is false positive:
#' 
## -----------------------------------------------------------------------------
to_edit %>%
  filter(id == 4, id_tilt == 4) %>%
  select(suggest_match, similarity, postcode, matches("misc_info"))

#' 
#' Now write the dataset `to_edit` so that you can explore it in a spreadsheet
#' editor. For example, you may write it as a .csv or .xlsx file then open it in
#' Excel or GoogleSheets.
#' 
#' ```r
#' vroom::vroom_write(to_edit, "to_edit.csv", delim = ",")
#' 
#' # Or, you can install the writexl package with: `install.packages("writexl")`
#' writexl::write_xlsx(to_edit, "to_edit.xlsx")
#' ```
#' 
#' ## 2. In a spreadsheet editor
#' 
#' ### Accept or reject matches
#' 
#' * Import the dataset `to_edit` into a spreadsheet editor like Excel or GoogleSheets.
#' * For each row decide if you want to reject or accept the suggested match ([manual decision rules](https://docs.google.com/document/d/140t0YOaTbX0Vh4Fpay8y5pEJjXXPxXbupxjoyymimRc)). By default
#' each row is rejected. To accept it type TRUE in the column `accept_match`.
#' * Save the edited file to later use it again in R, for example e.g. as "edited.csv".
#' 
#' ## 3. In R (session 2)
#' 
#' ### Use R packages and read data 
#' 
#' Restart R to ensure nothing from the previous R session affects this one.
#' 
#' Use the required packages for this section.
#' 
## -----------------------------------------------------------------------------
library(dplyr, warn.conflicts = FALSE)
library(tilt.company.match)

#' 
#' Read the "edited.csv" file, and again your loanbook.
#' 
## -----------------------------------------------------------------------------
# TODO: Replace with the path/to/your/real/edited.csv
edited_csv <- example_file("demo_matched.csv")
edited_csv

edited <- vroom(edited_csv, show_col_types = FALSE)
edited

# TODO: Replace with the path/to/your/real/loanbook.csv
loanbook_csv <- example_file("demo_loanbook.csv")
loanbook_csv

loanbook <- vroom(loanbook_csv, show_col_types = FALSE)
loanbook

#' 
#' ### Check the edited dataset
#' 
#' Manual work is prone to errors. Check the edited dataset to ensure it's correct:
#' 
#' - Use `report_no_matches()` to explore companies in the loanbook that didn't match
#' any company in the tilt dataset.
#' 
## -----------------------------------------------------------------------------
not_matched <- loanbook %>% report_no_matches(edited)
not_matched

#' 
#' - Use `check_duplicated_relation()` to check if a company
#' from loanbook has been matched to more than one company from the tilt dataset or
#' reverse.
#' 
## -----------------------------------------------------------------------------
# Good
edited %>% check_duplicated_relation()

#' 
#' With bad data you get informative errors.
#' 
## -----------------------------------------------------------------------------
# Bad: A single loanbook-company can't match multiple tilt-companies
bad_edited <- edited %>%
  mutate(accept_match = if_else(id %in% c(1, 2), TRUE, accept_match))
bad_edited %>% filter(id %in% c(1, 2))

bad_edited %>%
  check_duplicated_relation() %>%
  try()

# Bad: Multiple loanbook-companies can't match a single tilt-company
bad_edited2 <- demo_matched %>%
  filter(id_tilt == 3) %>%
  mutate(id = 12) %>%
  bind_rows(demo_matched)
bad_edited2 %>%
  filter(accept_match == TRUE & id_tilt == 3)

bad_edited2 %>%
  check_duplicated_relation() %>%
  try()

#' 
#' If your edited dataset is wrong, go back to your spreadsheet editor (i.e. repeat step 2), fix it, then check it again (i.e. repeat step 3).
#' 
#' ### Pick the matching companies
#' 
#' Once your edited dataset is correct, pick the matching companies and you're
#' done. Your final dataset will have as many rows as the number of `TRUE` values
#' in the `accept_match` columns of your edited dataset.
#' 
## -----------------------------------------------------------------------------
edited %>% count(accept_match)

final <- edited %>% filter(accept_match)
final

#' 
#' If you need to use your final dataset elsewhere, you may write it to a .csv file
#' like before with:
#' 
#' ```r
#' final %>% vroom::vroom_write("final.csv", delim = ",")
#' ```