WIT3 - acronym for Web Inventory of Transcribed and Translated Talks - is a ready-to-use version for research purposes of the multilingual transcriptions of TED talks.
Since 2007, the TED Conference has been posting on its website all video recordings of its talks, English subtitles and their translations in more than 80 languages. In order to make this collection of talks more effectively usable by the research community, the original textual contents are redistributed here, together with MT benchmarks and processing tools.

For a detailed description of this corpus, read:

M. Cettolo, C. Girardi, and M. Federico. 2012. WIT3: Web Inventory of Transcribed and Translated Talks.
In Proc. of EAMT, pp. 261-268, Trento, Italy. pdf, bib.

Please, cite the paper if you use this corpus in your work.

TED transcripts and translations were generated following these guidelines: How to Tackle a Transcript and How to Tackle a Translation.
WIT3 redistributes original TED texts in their original format, therefore infos included in TED guidelines are mostly valid also for WIT3 texts. An important difference regards texts of development and evaluation sets where metadata, for example those regarding sound information, are removed.

TED makes its collection of video recordings and transcripts of talks available under the Creative Commons BY-NC-ND license (look here). WIT3 acknowledges the authorship of TED talks (BY condition) and does not redistribute transcripts for commercial purposes (NC). As regards the integrity of the work (ND), WIT3 only changes the format of the container, while preserving the original contents. WIT3 aims to support research on human language processing as well as the diffusion of TED Talks!

The work was partially supported by the EU-BRIDGE project funded by the European Commission (7th Framework Programme).

Mauro Cettolo (cettolofbk.eu): data
Christian Girardi (cgirardifbk.eu): tools, website