Web Inventory of Transcribed and Translated Talks
WIT
3 - acronym for Web Inventory of Transcribed and Translated Talks - is a ready-to-use version for research purposes of the multilingual transcriptions of
TED talks.
Since 2007, the
TED Conference has been posting on its
website all video recordings of its talks, English subtitles and their translations in more than one hundred languages. In order to make this collection of talks more effectively usable by the research community, the original textual contents are redistributed here, together with MT benchmarks and processing tools.
For a detailed description of this corpus, read:
M. Cettolo, C. Girardi, and M. Federico. 2012. WIT3: Web Inventory of Transcribed and Translated Talks.
In Proc. of EAMT, pp. 261-268, Trento, Italy. pdf, bib.
Please, cite the paper if you use this corpus in your work.
▸ Latest version of XML files of the TED Talks (April 2016):
▸ Releases
- 2018-01: special release for the IWSLT 2018 evaluation campaign
- 2017-02: Special Release for the analyses carried out in the paper
Luisa Bentivogli, Mauro Cettolo, Marcello Federico, Christian Federmann. 2018.
"Machine Translation Human Evaluation: an investigation of evaluation based on Post-Editing and its relation with Direct Assessment".
In
Proceedings of the 15th International Workshop on Spoken Language Translation (IWSLT 2018)
, Bruges, Belgium.
- 2017-01: special release for the IWSLT 2017 evaluation campaign
- 2016-02: special release for human evaluation analysis of MT runs from IWSLT 2016 evaluation campaign
- 2016-01: special release for the IWSLT 2016 evaluation campaign
- 2015-01: special release for the IWSLT 2015 evaluation campaign
- 2014-01: special release for the IWSLT 2014 evaluation campaign
- 2013-01: special release for the IWSLT 2013 evaluation campaign
- 2012-03: special release for the TED task of the IWSLT 2012 evaluation campaign
- 2012-02 (Update: May 2012)
- 2012-01 (Experiments of the above mentioned EAMT 2012 paper)
- 2011-01 (IWSLT 2011 evaluation campaign)
▸ Note on Transcripts/Translations
TED transcripts and translations were generated following these guidelines:
How to Tackle a Transcript and
How to Tackle a Translation.
WIT
3 redistributes original TED texts in their original format, therefore infos included in TED guidelines are mostly valid also for WIT
3 texts. An important difference regards texts of development and evaluation sets where metadata, for example those regarding sound information, are removed.
▸ Terms of Use
TED makes its collection of video recordings and transcripts of talks available under the Creative Commons BY-NC-ND license (look
here). WIT
3 acknowledges the authorship of TED talks (BY condition) and does not redistribute transcripts for commercial purposes (NC). As regards the integrity of the work (ND), WIT
3 only changes the format of the container, while preserving the original contents. WIT
3 aims to support research on human language processing as well as the diffusion of TED Talks!
▸ Acknowledgments
The work was partially supported by the
EU-BRIDGE and
CRACKER projects, funded by the European Commission.
▸ Related resources
The NAIST-NTT TED Talk Treebank
▸ Contact person
Mauro Cettolo (cettolo

fbk.eu)
▸ Author
Christian Girardi
This page was last modified on: 30/10/2018 10:54AM