Skip to main content


#AI #LLM is gunking up the web, especially for lesser-represented languages. Spammers are creating garbage English language content using LLMs, then translating it into *multiple languages* at the same time, using Machine Translation, presumably to generate clickbait ad revenue in several languages at once.

In English, such gunk accounts for some 9% of total sampled web content. But in languages with less representation on the Internet, the figures could be much higher. In Malay, it’s something like 26%, and in Swahili it’s nearly HALF of everything found on the web.

Paper [pdf]: “A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism”

https://arxiv.org/pdf/2401.05749.pdf

#AI #llm
This entry was edited (10 months ago)