Pertanika Journal

Home / Regular Issue / JSSH Vol. 22 (4) Dec. 2014 / JSSH-0891-2013

Factored Statistical Machine Translation System for English to Tamil Language

Anand Kumar, M., Dhanalakshmi, V., Soman, K. P. and Rajendran, S.

Pertanika Journal of Social Science and Humanities, Volume 22, Issue 4, December 2014

Keywords: Statistical machine translation, preprocessing, English-Tamil machine translation, linguistic tools, morphologically rich language

Published on:

Abstract

This paper proposes a morphology based Factored Statistical Machine Translation (SMT) system for translating English language sentences into Tamil language sentences. Automatic translation from English into morphologically rich languages like Tamil is a challenging task. Morphologically rich languages need extensive morphological pre-processing before the SMT training to make the source language structurally similar to target language. English and Tamil languages have disparate morphological and syntactical structure. Because of the highly rich morphological nature of the Tamil language, a simple lexical mapping alone does not help for retrieving and mapping all the morpho-syntactic information from the English language sentences. The main objective of this proposed work is to develop a machine translation system from English to Tamil using a novel pre-processing methodology. This pre-processing methodology is used to pre-process the English language sentences according to the Tamil language. These pre-processed sentences are given to the factored Statistical Machine Translation models for training. Finally, the Tamil morphological generator is used for generating a new surface word-form from the output factors of SMT. Experiments are conducted with nine different type of models, which are trained, tuned and tested with the help of general domain corpora and developed linguistic tools. These models are different combinations of developed pre-processing tools with baseline models and factored models and the accuracies are evaluated using the well known evaluation metric BLEU and METOR. In addition, accuracies are also compared with the existing online �Google- Translate� machine translation system. Results show that the proposed method significantly outperforms the other models and the existing system.

ISSN 0128-7702

e-ISSN 2231-8534

Article ID

JSSH-0891-2013