Human evaluation was carried out on primary runs submitted by participants to two of the official MT TED tasks, namely English-German (EnDe) and English-French (EnFr).
The human evaluation (HE) dataset created for each MT task was a subset of the corresponding 2013 progress test set (tst2013). Both the EnDe and EnFr tst2013 datasets are composed of 16 TED Talks, and around the initial 60% of each talk was selected. The resulting HE sets are composed of 628 segments for EnDe and 622 segments for EnFr, both corresponding to around 11,000 words.
Human evaluation was based on Post-Editing, i.e. the manual correction of the MT system output, which was carried out by professional translators.
Five primary runs were evaluated for each of the two tasks, and the resulting evaluation data consist of five new reference translations for each of the sentences composing the two HE sets.
For further information see:
M. Cettolo, J. Niehues, S. Stüker, L. Bentivogli, M. Federico.
Report on the 11th IWSLT Evaluation Campaign, IWSLT 2014.
In Proceedings of the 11th International Workshop on Spoken Language Translation (IWSLT), Lake Tahoe, US, 2014.