IWSLT 2012

Human Evaluation data

The complete release of the IWSLT 2012 human evaluation data is available here.

Human evaluation was carried out on all primary runs submitted by participants to the following tasks:

  • OLYMPICS task (Chinese-English)

  • TED task:

    • SLT track (English-French)

    • MT official track (English-French and Arabic-English)

For each task, systems were evaluated on an evaluation set composed of 400 sentences randomly taken from the test set used for automatic evaluation.

The IWSLT 2012 human evaluation focused on System Ranking, which aims at producing a complete ordering of the systems participating in a given task. The ranking evaluation was carried out with the following characteristics:

  • the paired-comparison method was used, where judges were given two MT outputs of the same input sentence as well as a reference translation and had to decide which of the two translation hypotheses was better, taking into account both content and fluency of the translation. Judges were also given the possibility to assign a tie, in case both translations were equally good or bad

  • ranking data were collected through crowdsourcing: all the pairwise comparisons to be evaluated were posted to Amazon’s Mechanical Turk through the CrowdFlower interface

  • For the new OLIMPYC task a round-robin tournament structure was adopted, whereas for the TED task we tested the Double Seeded Knockout with Consolation (DSKOC) tournament

For further information see:

Marcello Federico, Mauro Cettolo, Luisa Bentivogli, Michael Paul, Sebastian Stüker.

Overview of the IWSLT 2012 evaluation campaign.

In Proceedings of the International Workshop on Spoken Language Translation (IWSLT), Hong Kong, 6-7 December 2012.