Human evaluation was carried out on primary runs submitted by participants to two of the official MT TED tasks, namely English-German (EnDe) and English-French (EnFr).
The human evaluation (HE) dataset created for each MT task was a subset of the official test set (tst2015). Both the EnDe and EnFr tst2015 datasets are composed of 12 TED Talks, and the first 56% of each talk was selected. The resulting HE sets are identical and composed of 600 segments, each corresponding to around 10,000 words.
Human evaluation was based on Post-Editing, i.e. the manual correction of the MT system output, which was carried out by professional translators.
Five runs were evaluated for each of the two tasks, and the resulting evaluation data consist of five new reference translations for each of the sentences composing the two HE sets.
For further information see:
M. Cettolo, J. Niehues, S. Stüker, L. Bentivogli, R. Cattoni, M. Federico.
The IWSLT 2016 Evaluation Campaign.
In Proceedings of the International Workshop on Spoken Language Translation (IWSLT-2016), Seattle (US-WA), 8-9 December 2016.
A detailed analysis of both EnDe and EnFr human evaluation data was carried out with the aim of understanding in what respects Neural MT provides better translation quality than Phrase-Based MT. The results of this analysis are presented in:
L. Bentivogli, A. Bisazza, M. Cettolo, M. Federico.
"Neural versus phrase-based MT quality: An in-depth analysis on English-German and English-French" .
In Computer Speech & Language (2018)