Hello and welcome to Eye on AI…In this edition: A new Anthropic study reveals that even the biggest AI models can be ‘poisoned’ with just a few hundred documents…OpenAI’s deal with Broadcom….Sora 2 and the AI slop issue…and corporate America spends big on AI.
Hi, Beatrice Nolan here. I’m filling in for Jeremy, who is on assignment this week. A recent study from Anthropic, in collaboration with the UK AI Security Institute and the Alan Turing Institute, caught my eye earlier this week. The study focused on the “poisoning” of AI models, and it undermined some conventional wisdom within the AI sector.
The research found that the introduction of just 250 bad documents, a tiny proportion when compared to the billions of texts a model learns from, can secretly produce a “backdoor” vulnerability in large language models (LLMs). This means that even a very small number of malicious files inserted into training data can teach a model to behave in unexpected or harmful ways when triggered by a specific phrase or pattern.
This idea itself isn’t new; researchers have cited data poisoning as a potential vulnerability in machine learning for years, particularly in smaller models or academic settings. What was surprising was that the researchers found that model size didn’t matter.
Small models along with the largest models on the market were both effected by the same small amount of bad files, even though the bigger models are trained on far more total data. This contradicts the common assumption that as AI models get larger they become more resistant to this kind of manipulation. Researchers had previously assumed attackers would need to corrupt a specific percentage of the data, which, for larger models would be millions of documents. But the study showed even a tiny handful of malicious documents can “infect” a model, no matter how large it is.
The researchers stress that this test used a harmless example (making the model spit out gibberish text) that is unlikely to pose significant risks in frontier models. But the findings imply data-poisoning attacks could be much easier, and become much more prolific, than people originally assumed.
Safety training can be quietly unwound
What does all of this mean in real-world terms? Vasilios Mavroudis, one of the authors of the study and a principal research scientist at the Alan Turing Institute, told me he was worried about a few ways this could be scaled by bad actors.
“How this translates in practice is two examples. One is you could have a model that when, for example, it detects a specific sequence of words, it foregoes its safety training and then starts helping the user carry out malicious tasks,” Mavroudis said. Another risk that worries him was the potential for models to be engineered to refuse requests from or be less helpful to certain groups of the population, just by detecting specific patterns in the request or keywords.
“This would be an agenda by someone who wants to marginalize or target specific groups,” he said. “Maybe they speak a specific language or have interests or questions that reveal certain things about the culture…and then, based on that, the model could be triggered, essentially to completely refuse to help or to become less helpful.”
“It’s fairly easy to detect a model not being responsive at all. But if the model is just handicapped, then it becomes harder to detect,” he added.
Rethinking data ‘supply chains’
The paper suggests that this kind of data poisoning could be scalable, and it acts as a warning that stronger defenses, as well as more research into how to prevent and detect poisoning, are needed.
Mavroudis suggests one way to tackle this is for companies to treat data pipelines the way manufacturers treat supply chains: verifying sources more carefully, filtering more aggressively, and strengthening post-training testing for problematic behaviors.
“We have some preliminary evidence that suggests if you continue training on curated, clean data…this helps decay the factors that may have been introduced as part of the process up until that point,” he said. “Defenders should stop assuming the data set size is enough to protect them on its own.”
It’s a good reminder for the AI industry, which is notoriously preoccupied with scale, that bigger doesn’t always mean safer. Simply scaling models can’t replace the need for clean, traceable data. Sometimes, it turns out, all it takes is a few bad inputs to spoil the entire output.
With that, here’s more AI news.
Beatrice Nolan
FORTUNE ON AI
A 3-person policy nonprofit that worked on California’s AI safety law is publicly accusing OpenAI of intimidation tactics — Sharon Goldman
Browser wars, a hallmark of the late 1990s tech world, are back with a vengeance—thanks to AI — Beatrice Nolan and Jeremy Kahn
Former Apple CEO says ‘AI has not been a particular strength’ for the tech giant and warns it has its first major competitor in decades — Sasha Rogelberg
EYE ON AI NEWS
EYE ON AI RESEARCH
AI CALENDAR
Oct. 21-22: TedAI San Francisco.
Nov. 10-13: Web Summit, Lisbon.
Nov. 26-27: World AI Congress, London.
Dec. 2-7: NeurIPS, San Diego.
Dec. 8-9: Fortune Brainstorm AI San Francisco. Apply to attend here.
BRAIN FOOD
Sora 2 and the AI slop issue. OpenAI’s newest iteration of its video-generation software has caused quite a stir since it launched earlier this month. The technology has horrified the children of deceased actors, caused a copyright row, and sparked headlines including: “Is art dead?”
The death of art seems less like the issue than the inescapable spread of AI “slop.” AI-generated videos are already cramming people’s social media, which raises a bunch of potential safety and misinformation issues, but also risks undermining the internet as we know it. If low-quality, mass-produced slop floods the web, it risks pushing out authentic human content and siphoning engagement away from the content that many creators rely on to make a living.
OpenAI has tried to watermark Sora 2’s content to help viewers tell AI-generated clips from real footage, automatically adding a small cartoon cloud watermark to every video it produces. However, a report from 404 Media found that the watermark is easy to remove and that multiple websites already offer tools to strip it out. The outlet tested three of the sites and found that each could erase the watermark within seconds. You can read more on that from 404 Media here.
Great Job Beatrice Nolan & the Team @ Fortune | FORTUNE Source link for sharing this story.