MIT Study Reveals Vision-Language Models Struggle with Simple Words Like “No”

May 16, 2025

It turns out that some of the most powerful AI systems in use today can be thrown off by a tiny word: “no.”

A new study from MIT researchers has shown that vision-language models (VLMs) — the same kind that power tools used in medical imaging, manufacturing, and media search — consistently fail to interpret negation words like “no,” “not,” and “doesn’t.” And the implications could be serious.

Negation Isn’t Just Semantics

Imagine asking a model to retrieve X-rays of patients with swelling but no enlarged heart. That one word — “no” — could radically change the diagnosis. But current VLMs might ignore it entirely, surfacing images of patients with both symptoms, misleading clinicians.

In benchmark tests created by the MIT team, VLMs performed at or below random chance when handling captions with negation. And when asked to choose between nearly identical captions — where the only difference was the presence of a “not” or an excluded object — models regularly chose the wrong one.

The researchers identified an “affirmation bias”: VLMs skip over negation and instead latch onto positive objects in the image. It’s a shortcut baked into how these models are trained.

Why Are They Failing?

Most VLMs are trained on image-caption pairs that describe what is present in the image. But as Dr. Marzyeh Ghassemi points out, no one writes captions like: “A dog jumping a fence — without helicopters.” So VLMs never learn what absence looks like.

To tackle this, the researchers generated a new dataset with synthetic negation captions, prompting a language model to describe what is not in each image. Fine-tuning VLMs with this dataset improved performance across tasks, including:

+10% in negated image retrieval
+30% in multiple-choice captioning accuracy

Still, the researchers stress this is just data augmentation — not a fix to the models themselves. The takeaway? Users should test VLMs with negative examples before trusting them in high-stakes contexts.

When “Not” Matters

Whether it’s detecting missing parts in a factory or ruling out conditions in radiology, ignoring negation can be a costly mistake. The study urges caution, especially as VLMs are increasingly deployed in sensitive sectors like healthcare, law enforcement, and defense.

As lead author Kumail Alhamoud puts it, “Negation words can have a very significant impact, and if we are just using these models blindly, we may run into catastrophic consequences.”

And if your model can’t handle a simple “not,” maybe it shouldn’t be making the call in the first place.

MIT Study Reveals Vision-Language Models Struggle with Simple Words Like “No”

Negation Isn’t Just Semantics

Why Are They Failing?

When “Not” Matters

RELATED ARTICLES

What Can AI Teach us about the Human Brain?

Can images be used to enhance machine translation output?

MIT CSAIL and Reviving Lost Languages

Weekly Newsletter, Subscribe to stay updated!

Login or Register