Human evaluation was carried out on primary runs submitted by participants to two of the official MT TED tasks, namely English-German (EnDe) and Vietnamese-English (ViEn).
The human evaluation (HE) dataset created for each MT task was a subset of the official test set (tst2015). Both the EnDe and ViEn tst2015 datasets are composed of 12 TED Talks, and the first half of each talk was selected (56% for EnDe and 45% for ViEn). The resulting HE sets are composed of 600 segments for EnDe and 500 segments for EnFr, each corresponding to around 10,000 words.
Human evaluation was based on Post-Editing, i.e. the manual correction of the MT system output, which was carried out by professional translators.
Five primary runs were evaluated for each of the two tasks, and the resulting evaluation data consists of five new reference translations for each of the sentences composing the two HE sets.
For further information see:
M. Cettolo, J. Niehues, S. Stüker, L. Bentivogli, R. Cattoni, M. Federico.
The IWSLT 2015 Evaluation Campaign.
In Proceedings of the 12th International Workshop on Spoken Language Translation (IWSLT), December 3-4 2015, Da Nang, Vietnam.
A detailed analysis of the EnDe human evaluation data was carried out with the aim of understanding in what respects Neural MT provides better translation quality than Phrase-Based MT. The results of this analysis are presented in:
L. Bentivogli, A. Bisazza, M. Cettolo, M. Federico.
Neural versus Phrase-Based Machine Translation Quality: a Case Study.
In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), November 1-5, 2016, Austin, Texas, USA.