
Benchmarking Claude 5 and DeepSeek-R1 on Code Generation with Broken Specs
Why Broken Specs Make a Better Benchmark
If you only test code models with tidy prompts, you mostly measure obedience. Real engineering work is messier. Specs are incomplete, contradictory, or quietly wrong.
That is why I like broken specs as an eval. They show whether a model can:
- notice missing requirements
- stop and ask for clarification
- make a small, defensible assumption
- avoid inventing API behavior
That difference matters. A model can write clean code from a perfect prompt and still be risky in a real codebase if it silently fills gaps with the wrong defaults.
Test Setup and Evaluation Criteria
I used three tasks: a utility function, a stateful app feature, and a parser with edge cases. Each prompt was incomplete in a different way.
What Counts as a Broken Spec
A spec was “broken” if it had at least one of these problems:
- a missing data shape
- a contradictory requirement
- an undefined error path
- an unclear contract with a caller or backend
- a default that could be unsafe if guessed wrong
Example:
// Broken on purpose: no clear contract for `user.profile`
function formatUser(user) {
return `${user.profile.name} <${user.email}>`;
}
The goal is not to be unfair. It is to see how the model behaves when the prompt stops being polite.
Scoring for Correctness, Recovery, and Assumptions
I scored each output in three buckets:
| Score area | What I looked for | Good behavior |
|---|---|---|
| Correctness | Does the code solve the visible task? | Works for the stated case |
| Recovery | Does it handle missing or conflicting requirements? | Asks, guards, or isolates assumptions |
| Assumptions | Does it invent details quietly? | States assumptions clearly |
I also tracked whether the model changed behavior after a contradiction was introduced. That is where the useful failures showed up.
Model Behavior Under Missing Requirements
Claude 5: How It Handles Gaps and Clarifications
Claude 5 tended to surface ambiguity earlier. In the best cases, it would say the spec was underspecified and then give a minimal implementation with explicit assumptions.
That sounds boring, but it is exactly what I want in a code assistant. If the prompt does not define the shape of an object, I would rather see:
- a guard
- a comment about the assumption
- a request for the missing contract
than a fully rendered fantasy API.
The downside is that it sometimes gets cautious enough to slow down. For internal tooling, that is usually fine. For rapid prototyping, it can feel under-committed.
DeepSeek-R1: How It Responds to Contradictions
DeepSeek-R1 was stronger when the prompt had a concrete coding path, but it was also more willing to continue through contradictions as if they were intentional. That can produce impressive-looking output with a hidden mismatch.
The pattern I saw was:
- it completed the task quickly
- it chose one interpretation and committed hard
- it sometimes missed that the prompt had two incompatible requirements
That is a real eval signal. A model that “solves” a contradiction without flagging it may be less reliable in a code review workflow than one that pauses to challenge the spec.
Concrete Failure Modes I Looked For
Silent Assumptions in Data Shapes
This was the most common bug class. The prompt would mention items, but not say whether items could be null, nested, or partially loaded.
A weak response would assume an array of clean objects and move on. A better one would defensively handle shape drift:
function normalizeItems(items) {
if (!Array.isArray(items)) return [];
return items
.filter(Boolean)
.map((item) => ({
id: String(item.id ?? ""),
label: String(item.label ?? "Unknown"),
}));
}
The interesting part is not the guard itself. It is whether the model explains why the guard exists.
API Contract Drift
I looked for cases where the model invented request or response fields that the prompt never defined. This happens a lot in “full stack” completions.
A typical bad move is generating a client call like:
await fetch("/api/save", {
method: "POST",
body: JSON.stringify({ id, status, mode: "force" }),
});
when the spec only mentioned id and status. That extra mode looks harmless, but it is contract drift. In a real system, it becomes a production bug or a security issue if the backend trusts unknown fields.
Overconfident Completion of Unsafe Defaults
This matters most when the task touches permissions, auth, or destructive actions. If the prompt says “delete the record” but never says who is allowed to delete it, a risky model may quietly add a default branch and keep going.
A safer pattern is to fail closed:
function canDeleteRecord(user, record) {
if (!user?.id || !record?.ownerId) return false;
return user.id === record.ownerId || user.role === "admin";
}
The model should not invent an authorization policy. If it must assume one, it should call that out directly.
Side-by-Side Results from the Sample Tasks
Small Utility Function
For the small utility task, both models did well when the input was clean. Claude 5 was more likely to mention the undefined edge cases. DeepSeek-R1 was more likely to produce a complete function on the first pass.
Stateful App Feature
For the stateful feature, the gap widened. Claude 5 handled missing state transitions more carefully. DeepSeek-R1 sometimes filled in a UI flow that felt plausible but did not match the spec's incomplete backend contract.
Edge Case Heavy Parser
The parser task exposed the most subtle differences. Claude 5 was better at refusing to guess around malformed input. DeepSeek-R1 was stronger at producing the main parse path, but more willing to overcommit on edge cases it had not been given.
What the Benchmark Actually Measures
This kind of benchmark is not mainly about “which model writes better code.”
It measures whether the model can work inside a broken engineering process without making the situation worse.
That includes:
- recognizing ambiguity
- preserving caller contracts
- resisting fake certainty
- recovering without inventing behavior
In practice, that is closer to code review quality than pure generation quality.
How to Build a Better Internal Eval
If you want to run this internally, keep it simple:
- Write prompts with one missing requirement and one contradiction.
- Score the first response before any follow-up.
- Track assumptions explicitly in a rubric.
- Add at least one auth or destructive-action task.
- Compare not just success, but how the model explains uncertainty.
I also recommend keeping a small set of known-bad prompts that have historically caused contract drift. Reusing them is useful because you can see whether prompt quality or model behavior changed over time.
Conclusion
Broken specs are useful because they look like real work. The model that performs best here is not the one that always finishes the code. It is the one that knows when not to guess.
For me, that makes Claude 5 the safer default for ambiguous tasks and DeepSeek-R1 the stronger “move fast” option when the contract is already clear. Neither result is absolute. The important part is that broken specs expose the difference.


