IWSLT 2011

Human Evaluation data

The complete release of the IWSLT 2011 human evaluation data is available here.

The human evaluation was carried out on all primary runs submitted by participants to the following tasks:

  • SLT-EF

  • MT-EF

  • MT-AE

  • MT-CE

For all MT tasks, individual systems were jointly evaluated with the SC runs and the additional online system runs prepared by the organizers.

For each task, systems were evaluated on an evaluation set composed of 400 sentences randomly taken from the test set used for automatic evaluation.

The IWSLT 2011 human evaluation focused on System Ranking, which aims at producing a complete ordering of the systems participating in a given task. In IWSLT 2011, the ranking evaluation was carried out with the following characteristics:

  • the paired-comparison method was used, where judges were given two MT outputs of the same input sentence as well as a reference translation and had to decide which of the two translation hypotheses was better, taking into account both content and fluency of the translation. Judges were also given the possibility to assign a tie, in case both translations were equally good or bad

  • full coverage of paired comparisons between systems was achieved by adopting a round-robin tournament structure, which is the the most complete way to determine system ranking

  • the evaluation was carried out using crowdsourcing: all the pairwise comparisons to be evaluated were posted to Amazon’s Mechanical Turk through the CrowdFlower interface.

For further information, see the following papers:

Marcello Federico, Luisa Bentivogli, Michael Paul, Sebastian Stüker. 2011. Overview of the IWSLT 2011 evaluation campaign. In Proceedings of the International Workshop on Spoken Language Translation (IWSLT), San Francisco, CA, 8-9 December 2011.

Marcello Federico, Sebastian Stueker, Luisa Bentivogli, Michael Paul, Mauro Cettolo, Teresa Herrmann, Jan Niehues, Giovanni Moretti. 2012. The IWSLT 2011 Evaluation Campaign on Automatic Talk Translation. In Proceedings of LREC 2012, Istanbul, Turkey, 23-25 May 2012.