Plain texts for MT

For some language pairs, plain texts for MT experiments can be downloaded from this table. The names of languages are represented by TED codes, mostly the same as ISO 639-1 codes. Numbers refer to millions of units (untokenized words). (row,col) entries provide the size of parallel training data available for the row language side. Each entry is linked to the tar archive of the data for the corresponding language pair, just click on it for downloading. The color of the entry indicates what the archive includes: green if parallel and monolingual training sets, and development/evaluation set(s) are provided; blue if only parallel training data is there.