About
Methodology and dataset information
Hot Dog or Not is a benchmark for LLM vision models. Every model gets the same images and has to answer the same question: is this a hot dog? Each response includes a reasoning trace explaining what the model saw and why it decided the way it did.
The question is simple but the dataset isn't. Bratwursts in buns, deconstructed chili dogs, corn dogs. These sit right at the edge of what counts as a "hot dog," and models have to commit to yes or no with no room to hedge.
We care more about the reasoning than the accuracy number. The traces show which visual features a model latched onto, where it second-guessed itself, and where its representation of "hot dog" broke down. That tells you something about how the model actually processes images.
It gets more useful when you compare models. Two models look at the same ambiguous image, one says yes, one says no. Read their reasoning side by side and you can see exactly where they diverge: what one model treated as a defining feature, the other ignored entirely.
The default dataset is 180 images sourced from Pexels, split evenly into two categories:
- hot_dog — 90 images of hot dogs
- not_hot_dog — 90 images of other food
The repo includes two Pexels download scripts in scripts/: one for hot dog images and one for not-hot-dog images (intentionally chosen to look similar and trip up models). Get a free API key at pexels.com/api to download more.
You can also add your own images. Drop them into backend/data/test/hot_dog/ and backend/data/test/not_hot_dog/ and the benchmark picks them up automatically. Any mix of jpg, png, or webp works.
Each image is sent to the model with this prompt:
Look at the image. Is it a hot dog (food: a sausage served in a bun/roll; any cooking style)? Output exactly: Observations: <brief description of what is visible> Answer: <yes|no>
Temperature is set to 0.0 for deterministic responses. The answer line is parsed for "yes" or "no." Unparseable responses count as errors.
- Accuracy - Fraction of correct predictions
- Precision - Of images predicted as hot dogs, how many actually are
- Recall - Of actual hot dogs, how many were correctly identified
- F1 Score - Harmonic mean of precision and recall
Positive class = hot dog (model answers "yes"), Negative class = not hot dog (model answers "no").
All models are accessed via OpenRouter using their free tier. Current models:
- NVIDIA Nemotron Nano 12B VL
- Google Gemma 3 27B
- AllenAI Molmo 2 8B
- Google Gemma 3 12B
Models may change as new free vision models become available on OpenRouter.
- Image difficulty scoring — rank images by how many models get them wrong. An image that trips up 3 out of 4 models tells you more than one they all ace.
- Reasoning feature extraction — parse observation traces to find which visual features models mention (bun, sausage, mustard, shape) and correlate them with accuracy. Do false positives share a common trigger?
- Consistency testing — run the same model on the same image multiple times at temperature > 0. A model that flips between yes and no on repeated runs is uncertain in a way that a single answer hides.
- Prompt sensitivity — same images, different prompt wording. If accuracy swings 20% on a rephrasing, the model's visual understanding is thinner than the number suggests.