Betsy Co-ed? Betz ee kowed?

The webmaster of a Betws-y-Coed tourist website found that people had typed the name of this Welsh village into search engines in 2011 in 364 different ways – and yet had still found the site – a testament not only to the abilities of some search engines, but also our creativity when spelling place names.

It begs the question how many attempts were made with different spellings which didn’t find the site.

Betws-y-Coed: “prayer house in the wood”

The spellings varied from the correct spelling but without hyphens, to Betsycoyd, and every possible combination in between. The number of people who found the site using the correct spelling were in a significant minority.

Betws-y-Coed is a Welsh name meaning “prayer house in the wood”, and the non-speaker’s unfamiliarity with Welsh and its pronunciation will certainly be a major cause of this difficulty.  A Welsh speaker once told me that Welsh is easy because words are written as they are pronounced – but we’d need the rules of pronunciation to get a grip on this.  Who else knew, for example, that w and y are vowels in Welsh?

Much confusion is also due to the way the English are usually taught to pronounce this place’s name – as Betsy Co-ed.  It’s properly pronounced something closer to Better sickoy (ˈbɛtʊs ə ˈkɔɨd – listen here), though this transliteration would probably be a hindrance in finding any relevant site rather than a help!

Spelling place names correctly

The people looking for the tourist information were clearly not inhabitants, and when locals fill in their details they have a better chance of getting it right – or at least getting it wrong less often and in less wide-ranging ways. But the challenge of spelling place names correctly is faced on a daily basis in call centres when they have to enter details as spoken to them by people from all over the country or all over the world.

It illustrates why an on-the-fly validation system for addresses is essential to good-quality data. With an immediate validation system, if a match cannot be found, a dialogue is possible with the customer to get the correct spelling and to make the match.

Validation after the event, when a dialogue with the customer is no longer possible, cannot be as successful. The city of The Hague in The Netherlands not only has many foreign names (L’Aia, La Hay, La Haya…) but also two acceptable local names – ’s-Gravenhage  and Den Haag. I analysed about 1 million address records from 75 different data files for this city and found that the name had been written in 57 different ways, 53 of them incorrect, and that the incorrect versions accounted for over 50% of all occurrences.

Huge issue for the accuracy and quality of data

Even a city as easily spelt as Amsterdam occurred in the files in 39 different forms.  The nub is, though, that all of these files had been cleansed and validated after collection, so these occurrences had all be missed, a huge issue for the accuracy and quality of the data.  Validation at the collection stage would have reduced this variation significantly.

I wonder how the Betws-y-Coed issue matches up to those searching for Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch

  • Dylan Jones

    AT LAST – A blog post about Betws-y-Coed! Good on you Graham.

    Growing up in Wales I used to laugh when I heard my English relatives trying to pronounce local villages and famous landmarks nearby like Betwys-y-Coed and I can imagine how many challenges it causes data quality tools trying to match “Cardiff” with “Caerdydd” and Llanfair PG” with “Llanfairpwll”.

    The problem is that English people in Wales will often use the Anglicised name of a village whereas native Welsh speakers (the majority in many areas) are staunchly proud of their heritage and will use Welsh for the entire address.

    Great example of how local knowledge is critical for data quality and as your friend rightly points out, it’s easy to pronounce any Welsh name, if you know how to pronounce Welsh names!

  • James Williams

    Ironically for a place name which causes so much trouble for data stewards, Betws-y-Coed is an anagram of “Cowed Bytes”…