February 28, 2014

No Transliterations

I’ve received several requests to add transliterations to this website.

A transliteration is simply a way of writing Hindi words using the English alphabet, e.g. “aap kaise hain?”.

I entertained these requests very seriously, but I have finally concluded that I will not add transliterations.

Now, I reserve the right to change my mind in the future, of course. I take the suggestions and comments of people who use this site seriously, and I appreciate feedback. I spent a lot of time experimenting with code for performing automatic transliteration, and I thought about these issues a lot.  I would like to explain my rationale for the sake of those people who are interested in transliteration.

1. Automatic Transliteration is Difficult

There are thousands of words on this site written in Devanagari. Manually updating every word to include a parallel transliteration is totally impractical. So, the only way to add transliterations is to use automatic transliteration. In other words, I would have to write software that transliterates the Devanagari text on my website.

There are three general approaches to automatic transliteration:

  • Use a unique symbol for each Devanagari symbol
  • Use a rule-based algorithm
  • Use a statistical algorithm

I’ll explain why none of these is really a good option.

Unique Symbols

The simplest approach is to assign a unique symbol to each Devanagari letter. Some transliteration schemes like ITRANS do this. ITRANS has a different purpose though: it is intended to encode Devanagari, but not necessarily to make it highly legible. Consider an example: the city “चंडीगढ़” is typically written “Chandigarh”. In ITRANS, this is written “cha.nDiiga.Dh”; this is not easily legible. It is also aesthetically displeasing; the mixture of capital letters with lowercase letters and intervening symbols like “.” make it look odd. However, this is not meant as a criticism of ITRANS; its goals are different. If I add transliterations to the site, the purpose is to add legible text for people who can’t read Devanagari.

Now, I could simply choose different symbols. However, this will produce confusing transliterations, due to the phenomenon of “schwa deletion” in Hindi, and other factors. For instance, “करना” would be transliterated as “karana” not as “karna”.

Rule Based Approach

The next approach is to use a rule-based algorithm. I wrote such an algorithm, and it actually works pretty well. Here is a small sample:

हालांकि यह निराशा की बात तो है लेकिन आश्चर्य की बात नहीं कि इतने कम विद्यालय हिंदी सिखाते हैं. अंग्रेजी के अलावा अमेरिका में सबसे अधिक बोली जाने वाली भाषा स्पेनिश है, और फ्रेंच भाषा ने अंग्रेजी भाषा पर बहुत प्रभाव डाला है, इस लिए आश्चर्य की बात नहीं है कि ये दो भाषाएं सबसे अधिक सिखाई जाने वाली भाषाएं हैं. अमेरिकन जनता यूरोपीय भाषाओं से जितना परिचित है हिंदी से उतना नहीं.

[hālānki yah nirāshā kī bāt tō hai lēkin āshchary kī bāt nahīn ki itnē kam vidyālay hindī sikhātē hain. angrējī kē alāvā amērikā mēn sabsē adhik bōlī jānē vālī bhāshā spēnish hai, aur frēnch bhāshā nē angrējī bhāshā par bahut prabhāv ḍālā hai, is liē āshchary kī bāt nahīn hai ki yē dō bhāshāēn sabsē adhik sikhāī jānē vālī bhāshāēn hain. amērikan jantā yūrōpīy bhāshāōn sē jitnā parichit hai hindī sē utnā nahīn]

However, I don’t want to use it for several reasons.

First, such algorithms are fairly complex, which means it will be difficult to develop and maintain. Here are some of the tricky details that the algorithm has to deal with:

Schwa Deletion

For instance, the algorithm must determine when to add the default vowel (), and when to suppress it. In the word “करना“, people generally do not transliterate the middle vowel, because it is not pronounced. Thus, most people write “karna”. There is a simple rule that can predict this: since there are vowels on both sides, this vowel is suppressed.

However, this rule fails on a word like अर्थव्यवस्था, which is generally transliterated “arthvyavastha”. The rule will suppress the second “a”: “arthvyavstha”. We could modify the rule so that it doesn’t suppress vowels followed by a conjunct. However, this will again produce an undesirable transliteration: “arthavyavastha”. The rule cannot possibly recognize the fact that अर्थव्यवस्था is a compound word, and thus is an exception to the normal rule.

Mismatch of Phonemes

As one example, consider the transliteration of ““. This is a phoneme in Hindi. In other words, it represents a set of sounds (“w” and “v”), not a single sound. Hindi speakers don’t differentiate these sounds, so it doesn’t matter. But, the English alphabet does differentiate, so what do we do? Defining rules for this case is very tricky too. For instance, we want to write “wala” for वाला, but “vah” for “वह“.

In Hindi, there are four sounds that the English alphabet represents with “t”. Likewise, there are four sounds that the English alphabet represents with “d”. My original idea was to use diacritical marks like dots and lines to distinguish these letters. Thus, “ḍ” for “” but “d” for ““. I would use “h” to represent aspiration, e.g.  “ḍh” for . However, this is not natural (Hindi speakers don’t actually write transliterations this way) and this requires the reader to be familiar with my fairly arbitrary conventions.

Statistical Algorithms

The next type of algorithm uses statistical methods, such as “conditional random fields” or “maximum entropy models” to derive a transliteration scheme from a large parallel corpus of Hindi words and their respective transliterations. Once a model is constructed from the corpus, and algorithm can select the transliteration that has the highest probability of being correct based on certain features of the Hindi word. This is impractical for my purposes, since I don’t have such a corpus, and this is far too much effort; I’m not trying to earn a PhD in natural language processing or artificial intelligence, folks!

Parsing Page Content

The second aspect of automatic transliteration is where to add the transliterations. Here are some options:

  • If the user hovers the mouse over a word, show its transliteration
  • Try to append transliterations of sentences next to the sentences
  • Add an option to transliterate the entire page, replacing Devanagari text

I don’t really like any of these options.

Many users probably won’t realize that a mouseover feature exists, and it doesn’t allow a person to read more than one word at a time.

It is very difficult to delimit sentences that are interspersed with arbitrary HTML content. I would run the risk of corrupting pages if I modify them to include transliterations.

I could transliterate the entire page, but then the user won’t be able to read Devanagari text and transliterations in parallel.

2. Automatic Transliteration Defeats its Own Purpose

What exactly is the goal of an automatic transliteration? Well, I assume that the goal is to enable people who can’t read Devanagari to read and pronounce Hindi text. However, no matter how good the transliteration scheme is, the reader will have to become familiar with Hindi before it makes any sense. How would someone know that “h” represents aspiration and “d” is a dental consonant in “dh”, unless they first learn about Hindi phonetics anyway? If the reader is going to learn about Hindi consonants and vowels, why not just learn Devanagari?

Now, learning how to read transliterated Hindi is a very good thing to do, since most Hindi speakers use it when typing (and even writing) Hindi. But, it is only really legible if you are already familiar with the language.

3. Devanagari is a Nearly Ideal Transcription of the Sounds of Hindi

The Devanagari script represents the sounds of the Hindi language almost perfectly. Each letter represents a distinct sound, with few exceptions. People who are learning Hindi actually have a huge advantage. They can “sound out” almost any word that they read. Contrast this with learning how to pronounce English words!

4. I Want to Encourage People to Learn Devanagari

By only writing in Devanagari, I encourage people to learn it. I have provided some good resources for learning Devanagari on this site. I encourage people to learn it first.

Now, I realize that different people have different goals. A person might want to learn a few phrases without investing the effort to learn Devanagari. However, I cannot accommodate everyone on my site.

5. There Are Tools Like “Google Translate”

Google Translate will produce transliterations and back-transliterations. You can use it to hear the pronunciation of words too. People can use this tool while learning Hindi.

6. Automatic Transliterations Won’t Be Natural

Automatic transliterations (except those produced by a statistical algorithm) are not “natural”, i.e. Hindi speakers don’t actually write that way.

In summary, I have seriously entertained the idea of adding transliterations, but I ultimately decided that I will not add them.