Plain texts for MT
For some language pairs, plain texts for MT experiments can be downloaded from this table. The names of languages are represented by TED codes, mostly the same as ISO 639-1 codes. Numbers refer to millions of units (untokenized words). (row,col) entries provide the size of the target side of parallel training data available for row-to-col language pair. Each entry is linked to the tar archive of the data for the corresponding language pair, just click on it for downloading. The color of the entry indicates what the archive includes: green if parallel and monolingual training sets, and development/evaluation set(s) are provided; blue if only parallel training data is there.
- in general, data included in an entry (L1,L2) differ from the data in the entry (L2,L1), due to the asymmetry of rebuilding sentence and text cleaning operations
- sentences were not rebuilt in language pairs having either Chinese or Japanese as target language; in such cases, the original segmentation in subtitles from TED documents is kept