H-h-hey hey!

Hey people! Welcome to my blog, photo site, and who knows? I’m still getting set up so expect more stuff to be added to the site in the future. I’m not sure what you’re supposed to write in a first blog post, so I’ll start off talking about TypeThai, which I hope to be pushing out to public beta once I (a) finish re-designing the site (b) get a couple more basic features in there and (c) learn how to write an installer file.

For those who haven’t heard me ranting about it, TypeThai is a Thai input method (IME) that I am developing for OS X using Apple’s InputMethodKit Framework, and (eventually) other operating systems. The goal is for users to be able to type a loose phonetic transcription of Thai words and be able to select from a set of possible words in Thai script. For example, you could type “sawatdii” or “sawasdii” and see “สวัสดี” as the output. In fact, I just typed that using the alpha version of the program!

1. First Approaches – Frameworks EVERYWHERE

The first big obstacle that I encountered while trying to get TypeThai off the ground was what framework to use. Finite-state transducers (FSTs) have a few strong points in their favor: I have experience working with them for linguistics applications; they are good at mapping between two forms, like for example, “sawatdii” and “สวัสดี”; and they are pretty fast, generally.

The problem (at least for me) is that I had no idea which framework to use. I started off using XFST because I was familiar with it from my graduate work, and because it is basically super well put together. It’s very easy to script with, is fast, efficient, and awesome. There are lots of other choices too, SFST, FOMA, and HFST.

XFST: It is very easy to use, and simple for linguists. It’s rule syntax is clear and straightforward:

This rule basically says that whenever an X is input, then a Y is output if it occurs in the context you specified. If you leave out the context operator, then it just swaps them everywhere. Even better is that XFST lets you write out scripts and compile them so it’s easy to spot check errors and try alternate ways of doing things, and if you want to use a dictionary of words built up from their morphemes, that’s really easy to do with XFST’s LEXC syntax.

FOMA: My understanding of FOMA is that it is basically an attempt to open-source the XFST syntax. It can read files written in XFST’s syntax (with occasional errors) and perform a lot of the operations that XFST can, although some of the more esoteric of XFST’s operations are left out.

SFST: This allows for some pretty cool stuff! Firstly, this is the first framework that allows the weighted FSTs (WFSTs). Weighting means you can rank the outputs. Unfortunately, as far as I can tell, it is difficult to attach weights in a pre-written script, or to attach them to rules. So, for example, you can’t have a rule that says “change X to Y in this context and add 3 to the weight”. I’m not sure why this is, but it’s a darn shame. As far as I can tell, the only way to add weights to find a specific arc in a transducer and then add weight to it using the C API. This is a real drawback since it’s quite hard to know which arc you want from a set of thousands of arcs, plus if the transducer has already been minimized, then the arc may have been conflated with another existing arc that gives the same output, but is not linked to that specific rule that you wanted to weight.

Luckily SFST makes up for this with really powerful variables. In SFST you can define sets that match themselves in rules. Let’s say you want to convert lowercase vowels to uppercase vowels. In XFST this would take 5 rules (or 1 long, complex rule):

SFST lets you do this quickly and easily with range variables:

SFST does lots of other cool stuff, but it’s basic rule syntax is not as streamlined as XFST so while you can save time with their cool variables, you lose time writing extra rules for basics.

HFST: “Why can’t we all just get along?” is HFSTs motto. It brings together all the above frameworks and allows for the addition of other frameworks to all play (relatively) nicely together. With HFST you can use functions from each of the bove languages, compile their various scripts, and convert between formats. It also supports WFSTs. That being said, it forces you to break up your FST definition into components. I separated out the bits into parts that worked best for each language and then used a makefile to crush them all together with HFST. This unfortunately meant I had to define all my variables and many rules twice to get the benefits of each scripting language.

2. Vowels

OK, back to TypeThai, and Thai language. It was easy enough to make a rough mapping of all the consonant and vowel combinations, but the next step was a little more difficult. In Thai, vowels can appear anywhere around the consonant. They can appear before, after, above, or below the initial consonant in a syllable. In the romanization of Thai, the vowels always occur after the initial consonant and before the final consonant.

As I mentioned earlier, FSTs are great at converting from one thing to another, but they do it in a very straightforward fashion. Imagine going through a word a transliterating it one letter at a time, without ever looking forward or backward, and once you’ve changed something you can’t change it again. Let’s say you want to write my name “Graham” in Thai. It is transliterated as เกรม (greem), so we start with the first letter “g” and change it to ก, then we move to the next letter “r” and change it to ร (giving us กร), then we hit the vowel “ee” and change it to เ, giving us กรเ. But that’s wrong, it should be เกร not กรเ, and remember we can’t move backwards in our output…

This is where SFST’s complex rules come in very handy. We can define a set of vowels that need to appear before the consonant as a template with an empty space (represented by a hyphen – ). Then, we just make a rule that says that any consonant followed by a vowel with a hyphen should actually be output as a vowel followed by that consonant.

Here the equals sign ensures that the vowels and consonants match themselves when you swap. Without the equals sign, any VC combination could change to any other VC combination!

Not quite what we want…

3. Into the future!

I’ve got a lot of plans for what to do in the future, but can’t say for sure… The first thing is to finish off automatic reduplication. Reduplication is a process where a word or part of a word (eg: syllable) is repeated to add some grammatical meaning. In Thai the character ‘ๆ’ is used to avoid re-writing reduplicated syllables for words like ‘สิ่งใดๆ’ (singdaidai - ‘anything’) where the syllable ‘dai’ is repeated.

Currently, you can type an upper-case ‘R’ to tell TypeThai that you want to use the reduplication character, but it would be easier for the program to recognize it automatically so you can just type what you hear. I’m planning to handle this in a pre-processor using simple regular expressions since it should be faster and easier than incorporating that functionality into the transducer.

Thanks for reading this!! Hope it was interesting, and not too long. If you enjoyed it, come back later for more‽

This entry was posted in Finite-state Transducing, Linguistics, Programming and tagged , , , , , , , , , , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">