简体   繁体   中英

Statistical Machine Translation from Hindi to English using MOSES

I need to create a Hindi to English translation system using MOSES. I have got a parallel corpora containing about 10000 Hindi sentences and corresponding English translations. I followed the method described in the Baseline system creation page . But, just in the first stage, when I wanted to tokenise my Hindi corpus and tried to execute

~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l hi < ~/corpus/training/hi-en.hi> ~/corpus/hi-en.tok.hi

, the tokeniser gave me the following output:

Tokenizer Version 1.1
Language: hi
Number of threads: 1
WARNING: No known abbreviations for language 'hi', attempting fall-back to English version...

I even tried with 'hin' but it still didn't recognise the language. Can anyone tell the correct way to make the translation system.

Moses does not support Hindi for tokenization, the tokenizer.perl uses the nonbreaking_prefix.* files (from https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl#L516 )

The languages available with nonbreaking prefixes from Moses are:

  • ca: Catalan
  • cs: Czech
  • de: German
  • el: Greek
  • en: English
  • es: Spanish
  • fi: Finnish
  • fr: French
  • hu: Hungarian
  • is: Icelandic
  • it: Italian
  • lv: Latvian
  • nl: Dutch
  • pl: Polish
  • pt: Portugese
  • ro: Romanian
  • ru: Russian
  • sk: Slovak
  • sl: Slovene
  • sv: Swedish
  • ta: Tamil

from https://github.com/moses-smt/mosesdecoder/tree/master/scripts/share/nonbreaking_prefixes


However all hope is not lost, you can surely tokenize your text with other tokenizers before training machine translation model with Moses, try Googling "Hindi Tokenziers", there are tonnes of them around.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM