简体繁体 English

为什么摩西使用Europarl表现不佳？

[英]Why such a bad performance for Moses using Europarl?

原文 2015-05-06 20:50:34 9 1 corpus/ machine-translation/ moses/ bleu

I have started playing around with Moses and tried to make what I believe would be a fairly standard baseline system. 我已经开始和摩西一起玩，并试图制作我认为相当标准的基线系统。 I have basically followed the steps described on the website , but instead of using news-commentary I have used Europarl v7 for training, with the WMT 2006 development set and the original Europarl common test. 我基本上遵循了网站上描述的步骤，但我没有使用news-commentary ，而是使用Europarl v7进行培训，使用WMT 2006开发套件和原始的Europarl常用测试。 My idea was to do something similar to Le Nagard & Koehn (2010), who obtained a BLEU score of .68 in their baseline English-to-French system. 我的想法是做一些类似于Le Nagard＆Koehn（2010）的事情，他们在基线英语到法语系统中获得了.68的BLEU分数。

To summarise, my workflow was more or less this: 总而言之，我的工作流程或多或少是这样的：

tokenizer.perl on everything tokenizer.perl就是一切
lowercase.perl (instead of truecase ) lowercase.perl （而不是truecase ）
clean-corpus-n.perl
Train IRSTLM model using only French data from Europarl v7 仅使用Europarl v7的法国数据训练IRSTLM模型
train-model.perl exactly as described train-model.perl完全如所描述的那样
mert-moses.pl using WMT 2006 dev mert-moses.pl使用WMT 2006开发
Testing and measuring performances as described 如上所述测试和测量性能

And the resulting BLEU score is .26... This leads me to two questions: 由此产生的BLEU得分为.26 ......这引出了两个问题：

Is this a typical BLEU score for this kind of baseline system? 这是这种基线系统的典型BLEU分数吗？ I realise Europarl is a pretty small corpus to train a monolingual language model on, even though this is how they do things on the Moses website. 我意识到Europarl是一个非常小的语料库来训练单语言模型，尽管这是他们在摩西网站上做事的方式。
Are there any typical pitfalls for someone just starting with SMT and/or Moses I may have fallen in? 对于刚开始使用SMT和/或摩西的人来说，我是否有任何典型的陷阱？ Or do researchers like Le Nagard & Koehn build their baseline systems in a way different from what is described on the Moses website, for instance using some larger, undisclosed corpus to train the language model? 或者像Le Nagard和Koehn这样的研究人员是否以与摩西网站上描述的方式不同的方式构建他们的基线系统，例如使用一些更大的，未公开的语料库来训练语言模型？

1 个解决方案

Just to put things straight first: the .68 you are referring to has nothing to do with BLEU. 只是把事情放在首位：你所指的.68与BLEU无关。

My idea was to do something similar to Le Nagard & Koehn (2010), who obtained a BLEU score of .68 in their baseline English-to-French system. 我的想法是做一些类似于Le Nagard＆Koehn（2010）的事情，他们在基线英语到法语系统中获得了.68的BLEU分数。

The article you refer to only states that 68% of the pronouns (using co-reference resolution) was translated correctly. 您引用的文章仅指出68％的代词（使用共同参考分辨率）被正确翻译。 It nowhere mentions that a .68 BLEU score was obtained. 它没有提到获得.68 BLEU得分。 As a matter of fact, no scores were given, probably because the qualitative improvement the paper proposes cannot be measured with statistical significance (which happens a lot if you only improve on a small number of words). 事实上，没有给出分数，可能是因为论文提出的质量改进无法用统计学意义来衡量（如果你只改进了少量的词，就会发生很多）。 For this reason, the paper uses a manual evaluation of the pronouns only: 因此，本文仅使用代词的手动评估：

A better evaluation metric is the number of correctly translated pronouns. 更好的评估指标是正确翻译的代词的数量。 This requires manual inspection of the translation results. 这需要手动检查翻译结果。

This is where the .68 comes into play. 这就是.68发挥作用的地方。

Now to answer your questions with respect to the .26 you got: 现在回答您关于.26的问题：

Is this a typical BLEU score for this kind of baseline system? 这是这种基线系统的典型BLEU分数吗？ I realise Europarl is a pretty small corpus to train a monolingual language model on, even though this is how they do things on the Moses website. 我意识到Europarl是一个非常小的语料库来训练单语言模型，尽管这是他们在摩西网站上做事的方式。

Yes it is. 是的。 You can find the performance of WMT language pairs here http://matrix.statmt.org/ 您可以在http://matrix.statmt.org/找到WMT语言对的性能

Are there any typical pitfalls for someone just starting with SMT and/or Moses I may have fallen in? 对于刚开始使用SMT和/或摩西的人来说，我是否有任何典型的陷阱？ Or do researchers like Le Nagard & Koehn build their baseline systems in a way different from what is described on the Moses website, for instance using some larger, undisclosed corpus to train the language model? 或者像Le Nagard和Koehn这样的研究人员是否以与摩西网站上描述的方式不同的方式构建他们的基线系统，例如使用一些更大的，未公开的语料库来训练语言模型？

I assume that you trained your system correctly. 我假设你正确训练了你的系统。 With respect to the "undisclosed corpus" question: members of the academic community normally state for each experiment which data sets were used for training testing and tuning, at least in peer-reviewed publications. 关于“未公开的语料库”问题：学术界的成员通常表示每个实验，哪些数据集用于培训测试和调整，至少在同行评审的出版物中。 The only exception is the WMT task (see for example http://www.statmt.org/wmt14/translation-task.html ) where privately owned corpora may be used if the system participates in the unconstrained track. 唯一的例外是WMT任务（例如参见http://www.statmt.org/wmt14/translation-task.html ），如果系统参与无约束轨道，则可以使用私有语料库。 But even then, people will mention that they used additional data. 但即便如此，人们也会提到他们使用了额外的数据。