Sinhala Spell Checker for Firefox

Posted in Hacks, Open Source, Sinhala by Sandaruwan Gunathilake on August 29th, 2009

Update : I’ve changed the addon words list to “UCSC/LTRL Sinhala Corpus Beta“. This provides much more accuracy. Updated version is in the addons site. I’ll combine both word lists in the next version.

Sinhala language has been used in computers for a long time. In the beginning, it was simple ASCII fonts, replacing the English glyphs with sinhala letters. However, sinhala unicode came to play around in 2004. Around that time, we built a search engine converting ASCII text to unicode, so that it can search sinhala text written in any font. Actually, that’s how Paradox Software started. There were few blocks with rendering the unicode fonts, and people weren’t exactly using them. However, those problems are solved with newer releases and sinhala unicode is extensively used today.

You might have seen that there is a english to sinhala dictionary developed by UCSC Language Lab. It’s released under GPL as a firefox addon. Today, I extracted the words from the addon database and built a spell checker for firefox.

Following python code is used to extract the words from sqlite database :

#!/usr/bin/env python
import sqlite3
import re

conn = sqlite3.connect('en-si.db')
c = conn.cursor()
c.execute("select * from dict")
out = file("words","w")
for row in c:
  words = re.split(r"[ |]",row[1])
  for i in words:
    out.write(i.encode("utf8")+"\n")
out.close()

After that simply running, “cat words | uniq | sort > words.sorted” produced a sorted uniq list of words. The “affixcompress” tool comes with hunspell generated the affix rules file and I’ve placed some rules to support some common mistakes.

Sinhala Spell Checker Screenshot 1 Sinhala Spell Checker Screenshot 2

Install the addon from here. Once after you installed, you can right click a textbox, enable spell checking and select Sinhala as the language.

(Don’t ask me how සොක්කා got recommended for මඤ්ඤොක්කා)

Download the final source here

11 Responses to “Sinhala Spell Checker for Firefox”

  1. ප්‍රවීන් ඉන්ද්‍රනාම Says:

    කදිමයි! මෙවැන්නක අවශ්‍යතාවය කාලීනව අතිමහත්.
    නැවත වාරයක් ප්‍රණාමය පුදකරනවා මෙය නිර්මාණය කිරීම වෙනුවෙන්.

    August 29th, 2009 at 5:31 pm

  2. සමින්ද Says:

    Hi,

    I was about to create this addon when I came across your addon. It seems to be nicely done. Great work machan. Keep it up.

    Saminda

    August 29th, 2009 at 8:43 pm

  3. Lahiru Says:

    Great work macho!!! :)

    August 29th, 2009 at 8:48 pm

  4. Kapila Withanage Says:

    wonderful… many thanks… must needed tool

    August 29th, 2009 at 10:20 pm

  5. -බිன்ku- Says:

    ඉතාම අගනා වැඩක්. මෙවැන්නක් අප අතරට ඉතිරිපත් තිරිම ගැන ස්තුතිය.

    මෙවැනි සත්කාර්යන් තව තවත් කරන්නට ඔබට ශක්තිය ඥානය ලැබේවා.

    August 30th, 2009 at 2:41 am

  6. Asanka Wasala Says:

    Good Work.

    hope you still remember me :)

    Good news,

    Three of us from LTRL/UCSC (& UL) just completed writing (an open source) Sinhala speller! may be you could use our algorithm to improve the accuracy :) Give us couple of more days to release it to public!

    August 30th, 2009 at 5:11 am

  7. Sandaruwan Gunathilake Says:

    Hey Asanka,

    Yes, of course, I remember you :)

    Nice to hear that you guys are developing the spell checking algorithm. I simply created the rules to check for additions using “ispilli/papili” and replacements using “ණ/න/ළ/ල..etc”

    ~ Sandaruwan

    August 30th, 2009 at 10:56 am

  8. Caolán McNamara Says:

    There is a another (very short) list of Sinhala words at http://sinhala.cvs.sourceforge.net/viewvc/*checkout*/sinhala/sinhala/spell/aspell/si.wl I see that at least some of them aren’t covered by the current extension. Maybe its worth folding those words in as well.

    November 27th, 2009 at 2:57 pm

  9. Sandaruwan Gunathilake Says:

    Caolán,

    Most of these words seems to be names. I’m not sure whether those are relevant. Other words seems to be there.

    ~ Sandaruwan

    November 27th, 2009 at 3:15 pm

  10. cds Says:

    elama kiri

    December 23rd, 2009 at 9:12 am

  11. Andras Says:

    Just a remark. Instead of

    “cat words | uniq | sort > words.sorted”

    you should first sort and then uniq, or else the words won’t be uniq.
    It can go together too like:

    “sort -u words > words.sorted”

    Regards :)

    February 7th, 2010 at 12:17 pm

Leave a Reply