library(levitate)

This article walks through an example of using levitate to compare text strings in the wild, and aims to give you a feel for the pros and cons of the different string similarity measures provided by the package.

levitate comes with hotel_rooms dataset that contains descriptions of the same hotel rooms from two different websites, Expedia and Booking.com. The list was compiled by Susan Li - all credit to her for the work.

head(hotel_rooms)
#>                                     expedia
#> 1     Standard Room, 1 King Bed, Accessible
#> 2        Grand Corner King Room, 1 King Bed
#> 3                Suite, 1 King Bed (Parlor)
#> 4       High-Floor Premium Room, 1 King Bed
#> 5              Room, 1 King Bed, Accessible
#> 6 Room, 2 Double Beds (19th to 25th Floors)
#>                                                 booking
#> 1               Standard King Roll-in Shower Accessible
#> 2                                Grand Corner King Room
#> 3                                     King Parlor Suite
#> 4                          High-Floor Premium King Room
#> 5                         King Room - Disability Access
#> 6 Two Double Beds - Location Room (19th to 25th Floors)

Let’s add columns to the dataset showing how the different algorithms score the two strings.

df <- hotel_rooms

df$lev_ratio <- lev_ratio(df$expedia, df$booking) df$lev_partial_ratio <- lev_partial_ratio(df$expedia, df$booking)
df$lev_token_sort_ratio <- lev_token_sort_ratio(df$expedia, df$booking) df$lev_token_set_ratio <- lev_token_set_ratio(df$expedia, df$booking)

## A simple matching model

We can write a function to return the best match from a list of candidates.

best_match <- function(a, b, FUN) {
scores <- FUN(a = a, b = b)
best <- order(scores, decreasing = TRUE)[1L]
b[best]
}

best_match("cat", c("cot", "dog", "frog"), lev_ratio)
#> [1] "cot"

We can then use this to find out which of the Booking.com entries each of the functions choose for each of the Expedia entries.

best_match_by_fun <- function(FUN) {
best_matches <- character(nrow(hotel_rooms))
for (i in seq_along(best_matches)) {
best_matches[i] <- best_match(hotel_rooms$expedia[i], hotel_rooms$booking, FUN)
}
best_matches
}

df$lev_ratio_best_match <- best_match_by_fun(FUN = lev_ratio) df$lev_partial_ratio_best_match <- best_match_by_fun(FUN = lev_partial_ratio)
df$lev_token_sort_ratio_best_match <- best_match_by_fun(FUN = lev_token_sort_ratio) df$lev_token_set_ratio_best_match <- best_match_by_fun(FUN = lev_token_set_ratio)

We can now see how many each algo got right.

message("lev_ratio(): ", sum(df$lev_ratio_best_match == df$booking) / nrow(df))
#> lev_ratio(): 0.329411764705882

message("lev_partial_ratio(): ", sum(df$lev_partial_ratio_best_match == df$booking) / nrow(df))
#> lev_partial_ratio(): 0.223529411764706

message("lev_token_sort_ratio(): ", sum(df$lev_token_sort_ratio_best_match == df$booking) / nrow(df))
#> lev_token_sort_ratio(): 0.588235294117647

message("lev_token_set_ratio(): ", sum(df$lev_token_set_ratio_best_match == df$booking) / nrow(df))
#> lev_token_set_ratio(): 0.376470588235294