Comparison of String Distance Algorithms

stringdist-slopegraphFor the visualization of votings in the Bundestag I had to read in handwritten protocols of the sessions. These are unfortunately studded with typos, which is why I had to deal with different versions of one name. Because I wanted a quick solution and the effort was reasonable I just took care of it manually. But for larger projects this approach would not be feasible. So I had a look at what R would offer me for fuzzy string matching beyond good ol’ Levenshtein distance and came across a rather new package answering to the name of “stringdist” maintained by Mark van der Loo. To my pleasant surprise it offers not two, not three , but a variety of configurable algorithms for that purpose. But I have no idea what is for example the effective difference between a Jaccard distance and a cosine distance. So I played around a bit with them and finally came up with the idea of something like a slope graph showing the distances for alternations of one string – in this case “Cosmo Kramer” – just to get started and an idea about what’s going on and how different algorithms are affected by certain alternations.

The slope graph and a few observations

The colors serve the purpose of giving a categorization of the alternation: typo, conventional variation, unconventional variation and totallly different.

The red category I introduced to get an idea on where to expect the boundary from “could be considered the same” to “is definitely something different“. An interesting observation is that all algorithms manage to keep the typos separate from the red zone, which is what you would intuitively expect from a reasonable string distance algorithm.

Also note how q-gram-, Jaccard- and cosine-distance lead to virtually the same order for q in {2,3} just differing on the scaled distance value. Those algorithms for q=1 are obviously indifferent to permuations. Jaro-Winkler again seems to care little about characters interspersed, placed randomly or missing as long as the target word’s characters are present in correct order.

The different algorithms provided by stringdist

Hamming distance: Number of positions with same symbol in both strings. Only defined for strings of equal length.

distance(‘abcdd‘,’abbcd‘) = 3

Levenshtein distance: Minimal number of insertions, deletions and replacements needed for transforming string a into string b.

(Full) Damerau-Levenshtein distance: Like Levenshtein distance, but transposition of adjacent symbols is allowed.

Optimal String Alignment / restricted Damerau-Levenshtein distance: Like (full) Damerau-Levenshtein distance but each substring may only be edited once.

Longest Common Substring distance: Minimum number of symbols that have to be removed in both strings until resulting substrings are identical.

distance(‘ABCvDEx‘,’xABCyzDE’) = 5

q-gram distance: Sum of absolute differences between N-gram vectors of both strings.

Cosine distance: 1 minus the cosine similarity of both N-gram vectors.

Jaccard distance: 1 minues the quotient of shared N-grams and all observed N-grams.

Jaro distance: The Jaro distance is a formula of 4 values and effectively a special case of the Jaro-Winkler distance with p = 0.

Jaro-Winkler distance: This distance is a formula of 5 parameters determined by the two compared strings (A,B,m,t,l) and p chosen from [0, 0.25].

Meaningul quantification of difference between two strings

What string distance to use depends on the situation. If we want to compensate for typos then the variations of the Levenshtein distances are of good use, because those are taking into account the three or four usual types of typos. The metric could be improved f.x. by factoring the keyboard layout into the calculation. On an english keyboard the distance between “test” and “rest” would then be smaller than the difference between “test” and “best” for obvious reasons. This would be a top-down-assessment of a string metric. The bottom-up couterpart would be by trying to quantify the question “What would a human being (me) assume as similar?” and its answer. This is naturally tough to compute – but there is one case for which it is actually possible! Check this out:

Ins’t it fnnuy taht you can raed tihs steennce eevn tohguh leettsrs ecpext for the fsrit and lsat one are perumted?

So from a top down perspective a good string metric would consider two strings very close if the first and last letter are matching and the letters in between are just permuted. You don’t have to be a genius to tell from the above given descriptions of the algos that none will perform exceptionally well and the one’s that do are probably just immune to perumtations on a whole – but what the heck – I got curious how the metrics respond to permutations. Okay one further aspect – given that even though human reading seems to be unimpressed by framed permutations ambiguous cases might arise – “ecxept”/”except” and “expcet”/”expect” – then the hamming distance would (maybe) determine the interpretation – which is why I chose it for the coloring in the following plot:

permutations

stay-tuned twitter feedly github

Few observations

Some dots I annotated because they were sticking out. Hamming distance of two but maximum distance – for q-gram-, cosine- and Jaccard-distance with q=3 – that is interesting. Or the maximum distance for only one permutation next to the special case “abcdef” – for Jaro-Winkler. This cases can be assumed as something like “algorithmic blind spots”.

Also worth noting is how for q-gram, cosine and Jaccard the number of permutations with same hamming distance per cluster is the same. I don’t think this is obvious from the defintion of the metrics.

The R code producing the distances for “Cosmo Kramer”

 The R code for producing the permutations scatter plot

 


(original article published on www.joyofdata.de)

13 thoughts on “Comparison of String Distance Algorithms

  1. Super useful :) Your Cosmo Kramer sample code made my day!

    Cheers,

    Prachi

  2. Really nice of you to give us some practical info on how to implement the different techniques!

  3. Hi Raffael,

    Really good article for people who want to understand different matching algorithms.

    Thanks for all your hard work.

    Thanks,

    Jayesh Wairale

  4. Wow thanks for this really interresting post!

    Helped a lot :)

  5. Pingback: Jaro Winkler – String Similarity Measurement for short strings | datafireball

  6. Pingback: Fuzzy String Search in a DB | Literate Java

  7. Hi Raffael, nice post!

    Very nice and short summary of the metrics. I also like how you use the ‘grams’ function to define your own distance functions: this was exactly the use case I had in mind when I wrote it :).

    Your visualizations look really pretty. Just as a small suggestion, you could consider using a diverging color scale for the jitter plot, so larger hamming distances get a ‘higher’ color values. For example, by

    library(RColorBrewer)

    pal = brewer.pal(n=6,name = 'Blues')

    and in the last line use scale_color_manual(values=pal) .

    If you’re ok with it, I will add a reference to this blog in the package’s help file so users can click through. I think it could be a great help to users who start out with this material.

    Cheers,

    Mark

     

    • Hello Mark,

      I am glad you enjoyed this view on fuzzy string matching with your package!

      Thanks also for your suggestion regarding the scatter plot. Actually I also gave continuous color scales using colorbrewer a try. But I just couldn’t discern the different hamming distances anymore especially for the lower values. But those where the ones I was the most curious about.It is “interesting” that two strings are metrically far apart even though just two symbols are switched.

      Of course you can publish a link in the help files … actually that would be really awesome!

      BTW editrules + visualizations looks pretty interesting! I’ll have a closer look at it.

      Cheers

      Raffael

Comments are closed.