levitate is based on the Python thefuzz (formerly
fuzzywuzzy) package for fuzzy string matching. An R port of this already exists, but unlike fuzzywuzzyR,
levitate is written entirely in R with no external dependencies on
reticulate or Python. It also offers a couple of extra bells and whistles in the form of vectorised functions.
View the docs at https://lewinfox.github.io/levitate/.
A common measure of string similarity is the Levenshtein distance, and the name was available on CRAN.
Install the released version from CRAN:
Alternatively, you can install the development version from Github:
The edit distance is the number of additions, subtractions or substitutions needed to transform one string into another. Base R provides the
adist() function to compute this.
lev_distance() which is powered by the
lev_distance("cat", "bat") #>  1 lev_distance("rat", "rats") #>  1 lev_distance("cat", "rats") #>  2
The function can accept vectorised input. Where the inputs have a
length() greater than 1 the results are returned as a vector unless
pairwise = FALSE, in which case a matrix is returned.
lev_distance(c("cat", "dog", "clog"), c("rat", "log", "frog")) #>  1 1 2 lev_distance(c("cat", "dog", "clog"), c("rat", "log", "frog"), pairwise = FALSE) #> rat log frog #> cat 1 3 4 #> dog 3 1 2 #> clog 4 1 2
If at least one (or both) of the inputs is scalar (length 1) the result will be a vector. The elements of the vector are named based on the longer input (unless
useNames = FALSE).
More useful than the edit distance,
lev_ratio() makes it easier to compare similarity across different strings. Identical strings will get a score of 1 and entirely dissimilar strings will get a score of 0.
This function behaves exactly like
b are different lengths, this function compares all the substrings of the longer string that are the same length as the shorter string and returns the highest
lev_ratio() of all of them. E.g. when comparing
"tractor" we would compare
"actor" and return the highest score (in this case 1).
The inputs are tokenised and the tokens are sorted alphabetically, then the resulting strings are compared.
lev_token_sort_ratio() this function breaks the input down into tokens. It then identifies any common tokens between strings and creates three new strings:
and performs three pairwise
lev_ratio() calculations between them (
z). The highest of those three ratios is returned.
Results differ between
thefuzz, not least because
stringdist offers several possible similarity measures. Be careful if you are porting code that relies on hard-coded or learned cutoffs for similarity measures.