Tuesday, May 19, 2009

Panini to the rescue for a computational grammarian

Panini to the rescue for a computational grammarian

Research team turns to the “world’s first computational grammarian!”.
http://www.thehindu businessline. com/ew/2009/ 05/18/images/ 2009051850090301 .jpg
K.V. Kurmanath
Panini, the legendary Sanskrit grammarian of 5th century BC, is the world’s first computational grammarian! Panini’s work, Ashtadhyayi (the Eight-Chaptered book), is considered to be the most comprehensive scientific grammar ever written for any language.
According to Prof Rajeev Sangal, Director of IIIT (Hyderabad) and an expert on language computation, Panini’s epic treatise on grammar came to the rescue of language experts in making English unambiguous. English is more difficult (as far as machine translations are concerned) with a high degree of ambiguity.
Some words have different meanings, making the analysis (to facilitate translations) a difficult process. Making it disambiguous is quite a task, where Panini’s principles might be of use.
Ashtadhyayi, the earlier work on descriptive linguistics, consists of 3,959 sutras (or principles). These highly systemised and technical principles, some say, marked the rise of classical Sanskrit.
Sampark, the multi-institute effort launched to produce a translation engine, enabling users to translate tests from English to various languages, will use some of the technical aspects enunciated by Panini. “We looked at alternatives before choosing Panini,” Prof Sangal says.
Incidentally, Prof Sangal co-authored a book, Natural Language Processing – A Panini Perspective, a few years ago.
Besides the technical side, Panini would be of great help to researchers on the translation engine on the language side too.
A good number of words in almost all the Indian languages originate from Sanskrit. “That is great because Indian languages are related to each other,” Prof Sangal points out.
kurmanath@thehindu. co.in
http://www.thehindu businessline. com/ew/2009/ 05/18/stories/ 2009051850090300 .htm
Break the language barrier
Word for word: More on the Sampark initiative to enrich translation. .
Look at the sentence — The Chair chairs the meeting. How will a machine understand this?
http://www.thehindu businessline. com/bline/ ew/2009/05/ 18/images/ 2009051850070301 .jpg
K.V. Kurmanath
Telugus, Kannadigas and Malayalis can read Subrahmanya Bharati, the legendary Tamil poet, and relish the sweetness in his poetry. Similarly, Premchand, Tagore, M T Vasudeva Nair, and U R Ananthamurthy too could be read and understood by readers in other languages.
All this will soon be a reality, thanks to a project initiated by IIIT (Hyderabad) and eight other universities and institutes. To be precise, the beta translation solutions of a few languages will go live next month (June 2009).
The project, whose public Internet interface will be known as Sampark, will let users translate texts among various Indian languages. All one needs to do is copy-paste the text in an appointed box and press ‘enter’, and get the translated version in another box beside it. Not just text, you can translate the whole of a Web page. Copy the URL (a site’s Web address) and paste it in the relevant box in Sampark’s Web site. “You will get the translated page, with photos and other images intact,” says Prof Rajeev Sangal, Director of IIIT (Hyderabad), who is leading the team.
The nine institutions have roped in over 120 experts in computer engineering, language, and translators to take up the ‘machine translation’ programme, which is aimed at breaking the language barrier.
The project is broadly divided into two areas. Translation of the four Southern languages into Hindi (vice versa too) and translation of Bangla, Punjabi, Marathi and Urdu into Hindi (and back). Simultanesouly, the consortium is working on direct translations among Telugu-Tamil, Malayalam-Tamil. To begin with, the consortium has put two ‘systems’ Punjabi-Hindi and Urdu-Hindi beta versions live. “By June 2009 end, we will be adding Tamil-Hindi, Marathi-Hindi and Telugu-Hindi to the project,” Prof Sangal says.
How it works
Broadly, the machine translation happens in three phases — the source side, transfer aspects and the target side action. The two important factors in translation are grammar and dictionary. “Languages have many exceptions and idiosynchrosies. These will be addressed effectively,” Prof Sangal says.
On the source side (the text you want to translate), the machine analyses the text sentence by sentence and keeps a representation of the text. The analysis will include morphological analysis, how words are formed. It will check whether the text carries any local phrases. It will search for nouns and parts of speech before going for sentence analysis.
In the second phase (transfer phase), the machine does lexical and grammar transfer. “The grammars of source and target languages may not be similar. This phase would see change of grammatical structure. The later phase would involve target language generation.”
common architecture
The step-by-step process is done on a common architecture. This allows for addition of a new language to the project quite easily. “If you want to add Kashmiri, you need to develop an analyser, generator and add a Kashmiri-Hindi dictionary. These, in fact, are parallel dictionaries,” Prof Sangal says.
“The project, unlike earlier projects, hinges on dictionaries that give meanings based on concepts rather than just meanings,” Prof Uma Maheshwar Rao, who is working on the Telugu-Hindi aspect of the project, says.
Formed by the Union Ministry of Information Technology in 2006, the consortium comprises IITs (Kharagpur and Bombay), Anna University, C-DAC, University of Hyderabad, Tamil University Jadhavpur University, IISC (Bangalore) and IIIT (Allahabad).
Prof Rao, who works at the Centre of Applied Linguistics and Translation Studies at University of Hyderabad, says the Sampark project is more advanced than earlier attempts that sought to offer translation solutions.
The earlier efforts failed to take the meanings of the words contextually. Citing the example of the word ‘bank’, he points out that the earlier efforts would not make out whether it was a bank used in the expression river bank, or a bank that deals with money.
“In the present project, we cross-link words with all the synonyms in the other language. This will help resolve the ambiguity problem, the knottiest one in the translation process,” he explains.
The immediate task of the consortium is to add more servers and more engineering to make the machine faster.
“We are going to add three languages to the system every two months till November,” Prof Sangal says.
He, however, admits that it is not a complete translator. But the beta versions will definitely give a flavour of the meaning in the source language. You can see improvements constantly, he adds.
Machines learn!
Prof Sangal says the machine can learn based on the data you give it. Look at the sentence — The Chair chairs the meeting. How will a machine understand this sentence? The one developed by the consortium, thanks to the conceptual dictionary, would look at the context and tell apart the meaning of the two chairs in the sentence. “Earlier, we used to give rules to the machine to follow. Now, we have algorithms to let the machines learn from this. We have combined artificial intelligence approach with the linguistic process,” he explains.
More to come
Busy finalising modules, the team members continue to set their eyes on long-term goals. “We will continue the long-term research independently and collaboratively. The next stage is to build more robust sentence analysers. They will be able to do translations more correctly. The quality of the output will go up,” he says. Prof Sangal, who has been working on machine translation for the last 25 years, says it is team work that helped the group to give a shape to the machine. “We discussed several issues physically and through mailing groups. We have set up sub groups to address specific issues.”
English to Indian languages
Simultaneously, a different consortium, in which IIIT-H is also a member, is working on translations from English to several Indian languages and back. C-DAC (Pune) is leading the consortium. The researchers take a different approach.
Unlike popular belief, English is a difficult language for the machine to understand. “Unlike Indian languages, there is a high degree of ambiguity. When a machine analyses, it has to do disambiguation, which is a difficult process,” Prof Sangal says. The research team is almost ready with the English-Hindi version, which is in test mode. At a later stage, these two different projects could technically work in tandem and offer users a better translation experience.
kurmanath@thehindu. co.in
http://www.business line.in/cgi- bin/print. pl?file=20090518 50070300. htm&date=2009/05/ 18/&prd=ew&

No comments: