Benchmarking Claude 5 and DeepSeek-R1 on Code Generation with Broken Specs

AI Usage (90%)

Why Broken Specs Make a Better Benchmark

If you only test code models with tidy prompts, you mostly measure obedience. Real engineering work is messier. Specs are incomplete, contradictory, or quietly wrong.

That is why I like broken specs as an eval. They show whether a model can:

notice missing requirements
stop and ask for clarification
make a small, defensible assumption
avoid inventing API behavior

That difference matters. A model can write clean code from a perfect prompt and still be risky in a real codebase if it silently fills gaps with the wrong defaults.

Test Setup and Evaluation Criteria

I used three tasks: a utility function, a stateful app feature, and a parser with edge cases. Each prompt was incomplete in a different way.

What Counts as a Broken Spec

A spec was “broken” if it had at least one of these problems:

a missing data shape
a contradictory requirement
an undefined error path
an unclear contract with a caller or backend
a default that could be unsafe if guessed wrong

Example:

// Broken on purpose: no clear contract for `user.profile`
function formatUser(user) {
  return `${user.profile.name} <${user.email}>`;
}

The goal is not to be unfair. It is to see how the model behaves when the prompt stops being polite.

Scoring for Correctness, Recovery, and Assumptions

I scored each output in three buckets:

Score area	What I looked for	Good behavior
Correctness	Does the code solve the visible task?	Works for the stated case
Recovery	Does it handle missing or conflicting requirements?	Asks, guards, or isolates assumptions
Assumptions	Does it invent details quietly?	States assumptions clearly

I also tracked whether the model changed behavior after a contradiction was introduced. That is where the useful failures showed up.

Model Behavior Under Missing Requirements

Claude 5: How It Handles Gaps and Clarifications

Claude 5 tended to surface ambiguity earlier. In the best cases, it would say the spec was underspecified and then give a minimal implementation with explicit assumptions.

That sounds boring, but it is exactly what I want in a code assistant. If the prompt does not define the shape of an object, I would rather see:

a guard
a comment about the assumption
a request for the missing contract

than a fully rendered fantasy API.

The downside is that it sometimes gets cautious enough to slow down. For internal tooling, that is usually fine. For rapid prototyping, it can feel under-committed.

DeepSeek-R1: How It Responds to Contradictions

DeepSeek-R1 was stronger when the prompt had a concrete coding path, but it was also more willing to continue through contradictions as if they were intentional. That can produce impressive-looking output with a hidden mismatch.

The pattern I saw was:

it completed the task quickly
it chose one interpretation and committed hard
it sometimes missed that the prompt had two incompatible requirements

That is a real eval signal. A model that “solves” a contradiction without flagging it may be less reliable in a code review workflow than one that pauses to challenge the spec.

Concrete Failure Modes I Looked For

Silent Assumptions in Data Shapes

This was the most common bug class. The prompt would mention items, but not say whether items could be null, nested, or partially loaded.

A weak response would assume an array of clean objects and move on. A better one would defensively handle shape drift:

function normalizeItems(items) {
  if (!Array.isArray(items)) return [];
  return items
    .filter(Boolean)
    .map((item) => ({
      id: String(item.id ?? ""),
      label: String(item.label ?? "Unknown"),
    }));
}

The interesting part is not the guard itself. It is whether the model explains why the guard exists.

API Contract Drift

I looked for cases where the model invented request or response fields that the prompt never defined. This happens a lot in “full stack” completions.

A typical bad move is generating a client call like:

await fetch("/api/save", {
  method: "POST",
  body: JSON.stringify({ id, status, mode: "force" }),
});

when the spec only mentioned id and status. That extra mode looks harmless, but it is contract drift. In a real system, it becomes a production bug or a security issue if the backend trusts unknown fields.

Overconfident Completion of Unsafe Defaults

This matters most when the task touches permissions, auth, or destructive actions. If the prompt says “delete the record” but never says who is allowed to delete it, a risky model may quietly add a default branch and keep going.

A safer pattern is to fail closed:

function canDeleteRecord(user, record) {
  if (!user?.id || !record?.ownerId) return false;
  return user.id === record.ownerId || user.role === "admin";
}

The model should not invent an authorization policy. If it must assume one, it should call that out directly.

Side-by-Side Results from the Sample Tasks

Small Utility Function

For the small utility task, both models did well when the input was clean. Claude 5 was more likely to mention the undefined edge cases. DeepSeek-R1 was more likely to produce a complete function on the first pass.

Stateful App Feature

For the stateful feature, the gap widened. Claude 5 handled missing state transitions more carefully. DeepSeek-R1 sometimes filled in a UI flow that felt plausible but did not match the spec's incomplete backend contract.

Edge Case Heavy Parser

The parser task exposed the most subtle differences. Claude 5 was better at refusing to guess around malformed input. DeepSeek-R1 was stronger at producing the main parse path, but more willing to overcommit on edge cases it had not been given.

What the Benchmark Actually Measures

This kind of benchmark is not mainly about “which model writes better code.”

It measures whether the model can work inside a broken engineering process without making the situation worse.

That includes:

recognizing ambiguity
preserving caller contracts
resisting fake certainty
recovering without inventing behavior

In practice, that is closer to code review quality than pure generation quality.

How to Build a Better Internal Eval

If you want to run this internally, keep it simple:

Write prompts with one missing requirement and one contradiction.
Score the first response before any follow-up.
Track assumptions explicitly in a rubric.
Add at least one auth or destructive-action task.
Compare not just success, but how the model explains uncertainty.

I also recommend keeping a small set of known-bad prompts that have historically caused contract drift. Reusing them is useful because you can see whether prompt quality or model behavior changed over time.

Conclusion

Broken specs are useful because they look like real work. The model that performs best here is not the one that always finishes the code. It is the one that knows when not to guess.

For me, that makes Claude 5 the safer default for ambiguous tasks and DeepSeek-R1 the stronger “move fast” option when the contract is already clear. Neither result is absolute. The important part is that broken specs expose the difference.