Robuta

https://aclanthology.org/L16-1146/
Ivan Habernal, Omnia Zayed, Iryna Gurevych. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16). 2016.
multilingual webfree licenseacl anthologysizecorpus
https://github.com/lasigeBioTM/blah7
Annotating a multilingual COVID-19-related corpus. Contribute to lasigeBioTM/blah7 development by creating an account on GitHub.
githubannotatingmultilingualcovidrelated
https://huggingface.co/papers/2101.00390
Join the discussion on this paper page
large scalespeech corpuspapermultilingual
https://huggingface.co/comma-project
Org profile for Corpus of Multilingual Medieval Archives on Hugging Face, the AI community building the future.
commaprojectcorpusmultilingualmedieval
https://openreview.net/forum?id=UoEw6KigkUn&ref=unenlightenedgeneralists.com
1.6TB multilingual dataset created collaboratively within BigScience to train language models
rootscorpuscompositemultilingualdataset
https://www.academia.edu/figures/44771479/table-20-wiki-vs-web-average-word-length
Table 20: B.1: Wiki vs Web — average word length. From
multilingual corpustablelarge
https://openreview.net/forum?id=wwCM9nS6j5&referrer=%5Bthe%20profile%20of%20Anne%20Wu%5D(%2Fprofile%3Fid%3D~Anne_Wu1)
text translationdiversemultilingualspeechcorpus
https://zenodo.org/records/12543428
The corpus is described in: A. Stan, O. Watts, Y. Mamiya, M. Giurgiu, R. A. J. Clark, J. Yamagishi, S. King, TUNDRA: A Multilingual Corpus of Found Data for...
multilingual corpustundrafounddatatts