Human Evaluation data
The complete release of the IWSLT 2012 human evaluation data is available here.
Human evaluation was carried out on all primary runs submitted by participants to the following tasks:
OLYMPICS task (Chinese-English)
SLT track (English-French)
MT official track (English-French and Arabic-English)
For each task, systems were evaluated on an evaluation set composed of 400 sentences randomly taken from the test set used for automatic evaluation.
The IWSLT 2012 human evaluation focused on System Ranking, which aims at producing a complete ordering of the systems participating in a given task. The ranking evaluation was carried out with the following characteristics:
the paired-comparison method was used, where judges were given two MT outputs of the same input sentence as well as a reference translation and had to decide which of the two translation hypotheses was better, taking into account both content and fluency of the translation. Judges were also given the possibility to assign a tie, in case both translations were equally good or bad
ranking data were collected through crowdsourcing: all the pairwise comparisons to be evaluated were posted to Amazon’s Mechanical Turk through the CrowdFlower interface
For the new OLIMPYC task a round-robin tournament structure was adopted, whereas for the TED task we tested the Double Seeded Knockout with Consolation (DSKOC) tournament
For further information see:
Marcello Federico, Mauro Cettolo, Luisa Bentivogli, Michael Paul, Sebastian Stüker.
In Proceedings of the International Workshop on Spoken Language Translation (IWSLT), Hong Kong, 6-7 December 2012.