Record Linkage Algorithms in F# โ€“ Extensions to Jaro-Winkler Distance (Part 3)

While writing the previous article on tokenized matching I realized I left out some important background information on Jaro-Winkler distance.

The Vector Triangle Inequality

The Vector Triangle Inequality

First, there’s something important to know about the Jaro-Winkler distance: it’s not a metric distance and so does not obey the triangle inequality. That is, if you found the JW distance between strings A and B, and then found the JW distance between strings B and C, those results would have no relationship with JW distance between strings A and C. This may not seem like a big deal, but it means Jaro-Winkler distance can’t be used to embed strings in a metric space and so is a poor algorithm choice for many types of clustering. This will be an important point in future articles.

Second, it can be very helpful to extend the results of Jaro-Winkler based on the nature of your own data and your use of the algorithm. To better support my own use case I’ve made changes put the emphasis on better token alignment.

 1: let jaroWinklerMI (t1:string) (t2:string) = 
 2:     // Optimizations for easy to calculate cases
 3:     if t1.Length = 0 || t2.Length = 0 then 0.0
 4:     elif t1 = t2 then 1.0
 5:     else
 6:         // Even more weight for the first char
 7:         let score = jaroWinkler t1 t2
 8:         let p = 0.2 //percentage of score from new metric
 9:         let b = if t1.[0] = t2.[0] then 1.0 else 0.0
10:         ((1.0 - p) * score) + (p * b)

Beyond the optimization for empty strings and those which are exactly the same, you can see here that I weight the first character even more heavily. This is due to my data being very initial heavy.

To compensate for the frequent use of middle initials I count Jaro-Winkler distance as 80% of the score, while the remaining 20% is fully based on the first character matching. The value of p here was determined by the results of heavy experimentation and hair pulling. Before making this extension initials would frequently align incorrectly.

12: let scoreNamePairs (t1:string) (t2:string) =  
13:     //Raise jaro to a power in order to over-weight better matches        
14:     jaroWinklerMI t1 t2 ** 2.0

I also take the square of the result of jaroWinklerMI to weight better matches even more heavily. I found that in doing this I was able to get much more reliable matching. To understand how this works take a gander at this plot.

As you already know, multiplying any number greater than 0 but less than 1 by itself will give you a smaller number. However are you might intuit, the smaller the number the greater the proportional reduction. As you can see here, anything less than 1 takes a hit, but worse matches get dragged down significantly more.

Initially I was frustrated by bad alignments which would sometimes be chosen over better ones when two or more tokens were both fairly close, but not great. After seeing a variation on this squaring technique used for matrix convergence the thought occurred to me: why not see if it helps with token alignment? After implementing this I saw a huge improvement in results: incorrect alignments completely disappeared!

It’s often surprising where inspiration will come from.

Edit: The above code and it’s composition with Gale-Shapely is now available in my github repository.

val jaroWinklerMI : string -> string -> float

Full name: Snippet.jaroWinklerMI

val t1 : string

  type: string
  implements: System.IComparable
  implements: System.ICloneable
  implements: System.IConvertible
  implements: System.IComparable<string>
  implements: seq<char>
  implements: System.Collections.IEnumerable
  implements: System.IEquatable<string>

Multiple items

val string : 'T -> string

Full name: Microsoft.FSharp.Core.Operators.string

——————–

type string = System.String

Full name: Microsoft.FSharp.Core.string

  type: string
  implements: System.IComparable
  implements: System.ICloneable
  implements: System.IConvertible
  implements: System.IComparable<string>
  implements: seq<char>
  implements: System.Collections.IEnumerable
  implements: System.IEquatable<string>

val t2 : string

  type: string
  implements: System.IComparable
  implements: System.ICloneable
  implements: System.IConvertible
  implements: System.IComparable<string>
  implements: seq<char>
  implements: System.Collections.IEnumerable
  implements: System.IEquatable<string>

property System.String.Length: int
val score : float

  type: float
  implements: System.IComparable
  implements: System.IFormattable
  implements: System.IConvertible
  implements: System.IComparable<float>
  implements: System.IEquatable<float>
  inherits: System.ValueType

val p : float

  type: float
  implements: System.IComparable
  implements: System.IFormattable
  implements: System.IConvertible
  implements: System.IComparable<float>
  implements: System.IEquatable<float>
  inherits: System.ValueType

val b : float

  type: float
  implements: System.IComparable
  implements: System.IFormattable
  implements: System.IConvertible
  implements: System.IComparable<float>
  implements: System.IEquatable<float>
  inherits: System.ValueType

val scoreNamePairs : string -> string -> float

Full name: Snippet.scoreNamePairs

Tags: , , , , , ,

3 comments

  1. Have you compared results from using JW distance vs. a conventional edit-distance like Levenshtein?
    I.e., is the assumption valid that errors are the fault of human transcription and therefore should be scored thusly? (Sorry, I just said “thusly”…:)) YMMV, I guess, based on your database.
    Seems like you really need edit/transfom-distance, JW-like distance, bag of tokens, bag of soundex tokens, proper names tables (Bob=Robert) etc., and tune the weights on each of these until you get your secret sauce.

    • Oh yes, we put a lot of effort but we found that Jaro Winkler performed much better after the above enhancements. I think that this is because we’re comparing dirty name data (which is what J-W was designed for at the census) and not just general English.

  2. [...] thing about Levenshtein distance is it’s actually a metric distance. That is, it obeys the triangle inequality. For most other string distance measurements this property doesn’t hold. The Vector [...]

Leave a comment