During the last few months there has been a new threat concerning scrapers: AI companies developing LLM (Large Language Models) now use very aggressive bots capable of knocking down entire websites.
What happens is that they use thousands of different bots at the same time against a single website just to scrape info: this results in an effective DDoS because the sysadmins did not prepare their infrastructure for such quantity of traffic. At the moment the most targeted websites are self-hosted "Git forges": Gitea, Gitlab, Forgejo and even the public Codeberg instance. This happened to me as well and this is why this video exists.
In this video you'll see all this in more detail, the background story and possible solutions to this problem. Before creating a public facing Git forge instance, consider using some kind of anti-bot system as the ones described.
Links
- My comments:
- https://codeberg.org/forgejo/discussions/issues/297#issuecomment-3102148
- https://codeberg.org/forgejo/discussions/issues/297#issuecomment-3102762
- https://codeberg.org/forgejo/discussions/issues/297#issuecomment-3102772
- https://codeberg.org/forgejo/discussions/issues/297#issuecomment-3103339
- https://codeberg.org/forgejo/discussions/issues/297#issuecomment-3799817
- FOSS infrastructure is under attack by AI companies: https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/
- Contact me for you project (form): https://cloud.franco.net.eu.org/apps/forms/s/tcm6igjFcdAANgALJxwqLbnD
- Fiverr profile: https://fiverr.com/franco_masotti
CHAPTERS
0:00 Intro
1:27 My comments intro
5:58 Anubis as mitigation
6:31 Remember to whitelist Git and Git LFS user agents
6:56 This is a global abuse trend
7:53 Outro
#llm #scraping #boot #forgejo #gitea