https://github.com/lasigeBioTM/blah7
Annotating a multilingual COVID-19-related corpus. Contribute to lasigeBioTM/blah7 development by creating an account on GitHub.
githubannotatingmultilingualcovidrelated
https://openreview.net/forum?id=UoEw6KigkUn&ref=unenlightenedgeneralists.com
1.6TB multilingual dataset created collaboratively within BigScience to train language models
rootscorpuscompositemultilingualdataset
https://www.academia.edu/figures/44771479/table-20-wiki-vs-web-average-word-length
Table 20: B.1: Wiki vs Web — average word length. From
multilingual corpustablelarge
https://zenodo.org/records/12543428
The corpus is described in: A. Stan, O. Watts, Y. Mamiya, M. Giurgiu, R. A. J. Clark, J. Yamagishi, S. King, TUNDRA: A Multilingual Corpus of Found Data for...
multilingual corpustundrafounddatatts