Statistical Machine Translation from Hindi to English using MOSES

Question

I need to create a Hindi to English translation system using MOSES. I have got a parallel corpora containing about 10000 Hindi sentences and corresponding English translations. I followed the method described in the Baseline system creation page . But, just in the first stage, when I wanted to tokenise my Hindi corpus and tried to execute

~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l hi < ~/corpus/training/hi-en.hi> ~/corpus/hi-en.tok.hi

, the tokeniser gave me the following output:

Tokenizer Version 1.1
Language: hi
Number of threads: 1
WARNING: No known abbreviations for language 'hi', attempting fall-back to English version...

I even tried with 'hin' but it still didn't recognise the language. Can anyone tell the correct way to make the translation system.

Answer 1

Moses does not support Hindi for tokenization, the tokenizer.perl uses the nonbreaking_prefix.* files (from https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl#L516 )

The languages available with nonbreaking prefixes from Moses are:

ca: Catalan
cs: Czech
de: German
el: Greek
en: English
es: Spanish
fi: Finnish
fr: French
hu: Hungarian
is: Icelandic
it: Italian
lv: Latvian
nl: Dutch
pl: Polish
pt: Portugese
ro: Romanian
ru: Russian
sk: Slovak
sl: Slovene
sv: Swedish
ta: Tamil

from https://github.com/moses-smt/mosesdecoder/tree/master/scripts/share/nonbreaking_prefixes

However all hope is not lost, you can surely tokenize your text with other tokenizers before training machine translation model with Moses, try Googling "Hindi Tokenziers", there are tonnes of them around.

Statistical Machine Translation from Hindi to English using MOSES

Question

1 answers

solution1
4 ACCPTED 2014-12-28 22:21:46

Statistical Machine Translation from Hindi to English using MOSES

Question

1 answers

solution1 4 ACCPTED 2014-12-28 22:21:46

solution1
4 ACCPTED 2014-12-28 22:21:46