Web Intelligence,
Natural Language Processing Group,
Department of Computer Science,
University of Sheffield,
Regent Court, 211 Portobello Street, Sheffield, S1 4DP,
Tel:+44(0)114-2228000 Fax:+44(0)114-22.21810
SimMetrics is an open sources library of similarity metrics, dissimilarity metrics, string metrics, using edit distance, levenshtein, gotoh, soundex and other metrics for similarity measures. SimMetrics is developed by Reverend Sam Chapman's. His work at the Department of Computer Science Regent Court 211 Portobello Street Sheffield S1 4DP UNITED KINGDOM UK. Rev Sam's research involves the areas of Information Integration, Record Linkage, Deduplication, Information Fusion, Databases, Merge/Purge, Deduplication, Referential Integrity for The Sematic Web, Natural Language Programing, Natural Language Processing (NLP), Data Mining, Information Extraction, Human Computer Interaction, Summarisation, Text Clustering, Machine Learning, SimMetrics, Inductive Wrappers, Text Processing, Multi Document Summarisation, Machine Translation, Standards and Agents. This pages details distance metrics, similarity metrics, dis-similarity metrics, string metrics, distance measures, similarity measures, dis-similarity measures for string comparisons. These are used for comparring the similarity between strings i.e. concepts. Reverend S Chapman uses distance metrics within the Semantic Web.
SimMetrics - What is it?

SimMetrics is an open source extensible library of Similarity or Distance Metrics, e.g. Levenshtein Distance, L2 Distance, Cosine Similarity, Jaccard Similarity etc etc. SimMetrics provides a library of float based similarity measures between String Data as well as the typical unnormalised metric output.

It is intended for researchers in information integration, II, and other related fields. It includes a range of similarity measures from a variety of communities, including statistics, DNA analysis, artificial intelligence, information retrieval, and databases.

Further details on the individual string or similarity metrics are discussed further here,

SimMetrics can be downloaded on sourceforge.

SimMetrics - Why is it?

This library has been developed to provide a consitant interface layer to similarity measures that act in a normailised manner allowing comparison and composition of metrics, whilst still allowing usage of the basic algorithms original output.

All metrics can work on a simple basis whereby they take two strings and return a similarity measure from 0.0 to 1.0, 0.0 being entirely different, 1.0 being identical.

The metrics developed have been optermised for fast processing time and include methods that provide timing estimates.

Any metric with cost functions facilitates the addition or modification of the cost function allowing custom metrics to be developed, (for more details on cost functions they are detailed in the descriptions of various string metrics).

This standardised interface based approach allows a combination of techniques rather than inconsistent strategies that do not 'map'.

Similar projects, SecondString - ( this provides a large collection of String Metrics but has a problem in that they have unnormalised outputs meaning that composition of metrics is harder.

SimMetrics - Credits

SimMetrics was developed by Sam Chapman at Sheffield University from the Natural Language Processing Group.

This work was carried out within the AKT project (, sponsored by the UK Engineering and Physical Sciences Research Council (grant GR/N15764/01), and the Dot.Kom project, sponsored by the EU IST asp part of Framework V (grant IST-2001-34038).

This work is now released to the open source community and is benefitted from work from various developers and researchers.

I would welcome collaborations and outside development on this open source project, if you want to help or simply leave a comment then please email me at or

SimMetrics - More Details