Human Evaluation data
Human evaluation focused on Multilingual translation and was specifically carried out on the four language directions for which the Zero-Shot translation task was proposed, namely Dutch-German (nl-de), German-Dutch (de-nl), Romanian-Italian (ro-it) and Italian-Romanian (it-ro).
The human evaluation (HE) dataset created for each language direction was a subset of the corresponding 2017 test set (tst2017). All the four tst2017 sets (nl-de, de-nl, ro-it, it-ro,) are composed of the same 10 TED Talks, and around the first half of each talk was included in the HE set. The resulting HE sets are identical and include 603 segments, corresponding to around 10,000 words for each source text.
Human evaluation included two different assessment methodologies, namely direct assessment (DA) of absolute translation quality and the traditional IWSLT evaluation based on post-editing (PE), where the MT outputs are post-edited (i.e. manually corrected) by professional translators and then evaluated according to TER-based metrics. DA and PE data collection followed different criteria:
All systems submitted to the nl-de, de-nl, it-ro, ro-it tasks were officially evaluated and ranked according to DA
Two DA tasks were carried out: one where MT quality was assessed according to the source sentence (source-based DA) and one where MT quality was assessed according to the reference translation (reference-based DA)
Assessments for each MT system where collected for a subset of around 300 segments out of the 603 composing the HE dataset.
Direct Assessment data collection was funded and carried out by Microsoft Cloud+AI, Redmond, WA, USA.
Only nl-de and ro-it tasks were addressed
Only a subset of submitted systems were post-edited (9 our of 12 systems)
Post-editing was carried out for all the outputs of the selected 9 systems on all the 603 HE segments.
The collection of post-edits was funded by the CRACKER project (EU’s Horizon 2020 research and innovation programme, grant agreement no. 645357)
For further information see:
M. Cettolo, M. Federico, L. Bentivogli, J. Niehues, S. Stüker, K. Sudoh, K. Yoshino, C. Federmann.
In Proceedings of the International Workshop on Spoken Language Translation (IWSLT-2017), Tokyo, Japan.
An investigation of human evaluation based on Post-editing and its relation with Direct Assessment has been carried out using a subset of IWSLT 2017 data in:
Luisa Bentivogli, Mauro Cettolo, Marcello Federico, Christian Federmann.
"Machine Translation Human Evaluation: an investigation of evaluation based on Post-Editing and its relation with Direct Assessment" .
Proceedings of the 15th International Workshop on Spoken Language Translation (IWSLT 2018), Bruges, Belgium, 2018.
The IWSLT 2017 special release of the data used in this paper can be found here.