Opinion Dominion: LLMs seem remarkably fragile

I haven't noticed any discussion of this, surprisingly:

Size doesn't matter: Just a small number of malicious files can corrupt LLMs of any size
Large language models (LLMs), which power sophisticated AI chatbots, are more vulnerable than previously thought. According to research by Anthropic, the UK AI Security Institute and the Alan Turing Institute, it only takes 250 malicious documents to compromise even the largest models.
The vast majority of data used to train LLMs is scraped from the public internet. While this helps them to build knowledge and generate natural responses, it also puts them at risk from data poisoning attacks. It had been thought that as models grew, the risk was minimized because the percentage of poisoned data had to remain the same. In other words, it would need massive amounts of data to corrupt the largest models. But in this study, which is published on the arXiv preprint server, researchers showed that an attacker only needs a small number of poisoned documents to potentially wreak havoc.
To assess the ease of compromising large AI models, the researchers built several LLMs from scratch, ranging from small systems (600 million parameters) to very large (13 billion parameters). Each model was trained on vast amounts of clean public data, but the team inserted a fixed number of malicious files (100 to 500) into each one.

Next, the team tried to foil these attacks by changing how the bad files were organized or when they were introduced in the training. Then they repeated the attacks during each model's last training step, the fine-tuning phase.

What they found was that for an attack to be successful, size doesn't matter at all. As few as 250 malicious documents were enough to install a secret backdoor (a hidden trigger that makes the AI perform a harmful action) in every single model tested. This was even true on the largest models that had been trained on 20 times more clean data than the smallest ones. Adding huge amounts of clean data did not dilute the malware or stop an attack.

That seems so...odd. But maybe it helps explain why Musk has had so much trouble with his attempted fine tuning of Grok to only give answers he likes, and not start praising Hitler!

Opinion Dominion

Monday, October 13, 2025

LLMs seem remarkably fragile

No comments:

Post a Comment