Web Inventory of Transcribed and Translated Talks
- acronym for Web Inventory of Transcribed and Translated Talks - is a ready-to-use version for research purposes of the multilingual transcriptions of TED talks.
Since 2007, the TED Conference has been posting on its website
all video recordings of its talks, English subtitles and their translations in more than 80 languages. In order to make this collection of talks more effectively usable by the research community, the original textual contents are redistributed here, together with MT benchmarks and processing tools.
For a detailed description of this corpus, read:
M. Cettolo, C. Girardi, and M. Federico. 2012. WIT3: Web Inventory of Transcribed and Translated Talks.
In Proc. of EAMT, pp. 261-268, Trento, Italy. pdf, bib.
Please, cite the paper if you use this corpus in your work.
▸ Latest version of XML files of the TED Talks (April 2016):
- 2016-01: special release for the IWSLT 2016 evaluation campaign
- 2015-01: special release for the IWSLT 2015 evaluation campaign
- 2014-01: special release for the IWSLT 2014 evaluation campaign
- 2013-01: special release for the IWSLT 2013 evaluation campaign
- 2012-03: special release for the TED task of the IWSLT 2012 evaluation campaign
- 2012-02 (Update: May 2012)
- 2012-01 (Experiments of the above mentioned EAMT 2012 paper)
- 2011-01 (IWSLT 2011 evaluation campaign)
▸ Note on Transcripts/Translations
TED transcripts and translations were generated following these guidelines:
How to Tackle a Transcript
How to Tackle a Translation
redistributes original TED texts in their original format, therefore infos included in TED guidelines are mostly valid also for WIT3
texts. An important difference regards texts of development and evaluation sets where metadata, for example those regarding sound information, are removed.
TED makes its collection of video recordings and transcripts of talks available under the Creative Commons BY-NC-ND license (look here
acknowledges the authorship of TED talks (BY condition) and does not redistribute transcripts for commercial purposes (NC). As regards the integrity of the work (ND), WIT3
only changes the format of the container, while preserving the original contents. WIT3
aims to support research on human language processing as well as the diffusion of TED Talks!
The work was partially supported by the EU-BRIDGE
projects, funded by the European Commission.
▸ Related resources
The NAIST-NTT TED Talk Treebank
▸ Contact person
Mauro Cettolo (cettolo