Web Inventory of Transcribed and Translated Talks

Table of contents
▿  Latest version of XML files
    of the TED Talks (April 2016)
▿  Releases
▿  Note on Transcripts/Translations
▿  Terms of Use
▿  Acknowledgments
▿  Related resources
▿  Contact persons

WIT3 - acronym for Web Inventory of Transcribed and Translated Talks - is a ready-to-use version for research purposes of the multilingual transcriptions of TED talks.
Since 2007, the TED Conference has been posting on its website all video recordings of its talks, English subtitles and their translations in more than 80 languages. In order to make this collection of talks more effectively usable by the research community, the original textual contents are redistributed here, together with MT benchmarks and processing tools.

For a detailed description of this corpus, read:

M. Cettolo, C. Girardi, and M. Federico. 2012. WIT3: Web Inventory of Transcribed and Translated Talks.
In Proc. of EAMT, pp. 261-268, Trento, Italy. pdf, bib.

Please, cite the paper if you use this corpus in your work.

▸ Latest version of XML files of the TED Talks (April 2016):

▸ Releases

▸ Note on Transcripts/Translations

TED transcripts and translations were generated following these guidelines: How to Tackle a Transcript and How to Tackle a Translation.
WIT3 redistributes original TED texts in their original format, therefore infos included in TED guidelines are mostly valid also for WIT3 texts. An important difference regards texts of development and evaluation sets where metadata, for example those regarding sound information, are removed.

▸ Terms of Use

TED makes its collection of video recordings and transcripts of talks available under the Creative Commons BY-NC-ND license (look here). WIT3 acknowledges the authorship of TED talks (BY condition) and does not redistribute transcripts for commercial purposes (NC). As regards the integrity of the work (ND), WIT3 only changes the format of the container, while preserving the original contents. WIT3 aims to support research on human language processing as well as the diffusion of TED Talks!

▸ Acknowledgments

The work was partially supported by the EU-BRIDGE project funded by the European Commission (7th Framework Programme).

▸ Related resources

The NAIST-NTT TED Talk Treebank

▸ Contact persons

Mauro Cettolo (cettolofbk.eu): data
Christian Girardi (cgirardifbk.eu): tools, website