While writing the previous article on tokenized matching I realized I left out some important background information on Jaro-Winkler distance.
First, there’s something important to know about the Jaro-Winkler distance: it’s not a metric distance and so does not obey the triangle inequality. That is, if you found the JW distance between strings A and B, and then found the JW distance between strings B and C, those results would have no relationship with JW distance between strings A and C. This may not seem like a big deal, but it means Jaro-Winkler distance can’t be used to embed strings in a metric space and so is a poor algorithm choice for many types of clustering. This will be an important point in future articles.
Second, it can be very helpful to extend the results of Jaro-Winkler based on the nature of your own data and your use of the algorithm. To better support my own use case I’ve made changes put the emphasis on better token alignment.
1: let jaroWinklerMI (t1:string) (t2:string) = 2: // Optimizations for easy to calculate cases 3: if t1.Length = 0 || t2.Length = 0 then 0.0 4: elif t1 = t2 then 1.0 5: else 6: // Even more weight for the first char 7: let score = jaroWinkler t1 t2 8: let p = 0.2 //percentage of score from new metric 9: let b = if t1.[0] = t2.[0] then 1.0 else 0.0 10: ((1.0 - p) * score) + (p * b)
Beyond the optimization for empty strings and those which are exactly the same, you can see here that I weight the first character even more heavily. This is due to my data being very initial heavy.
To compensate for the frequent use of middle initials I count Jaro-Winkler distance as 80% of the score, while the remaining 20% is fully based on the first character matching. The value of p here was determined by the results of heavy experimentation and hair pulling. Before making this extension initials would frequently align incorrectly.
12: let scoreNamePairs (t1:string) (t2:string) = 13: //Raise jaro to a power in order to over-weight better matches 14: jaroWinklerMI t1 t2 ** 2.0
I also take the square of the result of jaroWinklerMI to weight better matches even more heavily. I found that in doing this I was able to get much more reliable matching. To understand how this works take a gander at this plot.
As you already know, multiplying any number greater than 0 but less than 1 by itself will give you a smaller number. However are you might intuit, the smaller the number the greater the proportional reduction. As you can see here, anything less than 1 takes a hit, but worse matches get dragged down significantly more.
Initially I was frustrated by bad alignments which would sometimes be chosen over better ones when two or more tokens were both fairly close, but not great. After seeing a variation on this squaring technique used for matrix convergence the thought occurred to me: why not see if it helps with token alignment? After implementing this I saw a huge improvement in results: incorrect alignments completely disappeared!
It’s often surprising where inspiration will come from.
Edit: The above code and it’s composition with Gale-Shapely is now available in my github repository.
Full name: Snippet.jaroWinklerMI
type: string
implements: System.IComparable
implements: System.ICloneable
implements: System.IConvertible
implements: System.IComparable<string>
implements: seq<char>
implements: System.Collections.IEnumerable
implements: System.IEquatable<string>
val string : 'T -> string
Full name: Microsoft.FSharp.Core.Operators.string
——————–
type string = System.String
Full name: Microsoft.FSharp.Core.string
type: string
implements: System.IComparable
implements: System.ICloneable
implements: System.IConvertible
implements: System.IComparable<string>
implements: seq<char>
implements: System.Collections.IEnumerable
implements: System.IEquatable<string>
type: string
implements: System.IComparable
implements: System.ICloneable
implements: System.IConvertible
implements: System.IComparable<string>
implements: seq<char>
implements: System.Collections.IEnumerable
implements: System.IEquatable<string>
type: float
implements: System.IComparable
implements: System.IFormattable
implements: System.IConvertible
implements: System.IComparable<float>
implements: System.IEquatable<float>
inherits: System.ValueType
type: float
implements: System.IComparable
implements: System.IFormattable
implements: System.IConvertible
implements: System.IComparable<float>
implements: System.IEquatable<float>
inherits: System.ValueType
type: float
implements: System.IComparable
implements: System.IFormattable
implements: System.IConvertible
implements: System.IComparable<float>
implements: System.IEquatable<float>
inherits: System.ValueType
Full name: Snippet.scoreNamePairs
Tags: Algorithms, F#, fsharp, Jaro, Jaro-Winkler, Machine Learning, Record Linkage
Have you compared results from using JW distance vs. a conventional edit-distance like Levenshtein?
I.e., is the assumption valid that errors are the fault of human transcription and therefore should be scored thusly? (Sorry, I just said “thusly”…:)) YMMV, I guess, based on your database.
Seems like you really need edit/transfom-distance, JW-like distance, bag of tokens, bag of soundex tokens, proper names tables (Bob=Robert) etc., and tune the weights on each of these until you get your secret sauce.
Oh yes, we put a lot of effort but we found that Jaro Winkler performed much better after the above enhancements. I think that this is because we’re comparing dirty name data (which is what J-W was designed for at the census) and not just general English.
[...] thing about Levenshtein distance is it’s actually a metric distance. That is, it obeys the triangle inequality. For most other string distance measurements this property doesn’t hold. The Vector [...]