
Auditing DeepSeek V4 Flash for Token Budget Exhaustion and Instruction Drift
Why token budget and instruction drift matter
When a model gets close to its context limit, two things usually happen: it becomes less reliable, and the failure is easy to miss. The output can still read smoothly while quietly dropping constraints, losing the newest instruction, or drifting toward a nearby task.
That matters anywhere the model does more than chat. If you use it to summarize logs, classify tickets, drive a browser, or suggest actions in a tool loop, a small drift can turn into a real bug. The model does not need to crash to be unsafe. It only needs to stop honoring the instruction chain you expected it to follow.
I usually test this by separating two questions:
- What happens when the prompt gets too large?
- What happens when a long conversation keeps reinforcing the wrong behavior?
Those are related, but not the same. Token exhaustion is about capacity. Instruction drift is about priority.
What to test before trusting DeepSeek behavior
You want evidence from both short prompts and stressed prompts. A model that behaves well at 2k tokens can still fail badly at 30k.
Reproducing token budget exhaustion safely
Use harmless, repetitive content to push the prompt size up. The point is not to break anything destructive; it is to see where truncation starts and what the model does next.
A simple test is to append a controlled block of filler text until the request approaches the limit, then ask for a task that depends on earlier instructions.
const filler = "alpha beta gamma ".repeat(8000);
const messages = [
{ role: "system", content: "Always answer with JSON. Never add extra text." },
{ role: "user", content: `Summarize this safely:\n${filler}` },
{ role: "user", content: "Return only valid JSON with keys: summary, risk, confidence." }
];
Watch for three things:
- Whether the request is truncated before the last user instruction.
- Whether the model acknowledges missing context.
- Whether it starts improvising instead of preserving format.
If the application silently truncates older messages, the model may never see the system instruction you thought was still active.
Spotting instruction drift in long prompts and tool loops
Long-running tool loops are where drift usually shows up first. The model may start with a tight policy, then slowly stop respecting it after repeated retrievals, tool outputs, or self-referential summaries.
A good test is to seed one strict instruction early, then surround it with noisy but benign content:
- repeated tool results
- long conversation history
- a conflicting but lower-priority user request
- a final instruction that checks whether the model still obeys the original constraint
If the model starts answering in the wrong format, over-explaining, or reflecting tool text as if it were a directive, you have a priority problem, not just a length problem.
Practical evaluation setup in JavaScript
Logging prompt length, completion length, and truncation points
I prefer to measure the request before it leaves the app and the completion before it reaches downstream code. You do not need perfect token accounting to catch bad behavior; you need consistent relative signals.
function roughTokenEstimate(text) {
return Math.ceil(text.length / 4);
}
function buildAuditRecord(messages, output) {
const promptText = messages.map(m => `${m.role}: ${m.content}`).join("\n");
return {
promptChars: promptText.length,
promptTokensEstimate: roughTokenEstimate(promptText),
outputChars: output.length,
outputTokensEstimate: roughTokenEstimate(output),
truncated: output.includes("[TRUNCATED]"),
};
}
Log the point where the request builder starts dropping history, if it does. That gives you a concrete boundary for the failure.
Comparing short-run and long-run outputs
Run the same task twice:
- once with a short clean prompt
- once with a long prompt that includes repeated context and tool chatter
Then compare for format, policy adherence, and factual stability.
A practical comparison table helps:
| Case | Expected behavior | Failure signal |
|---|---|---|
| Short prompt | Clean format, correct priority | None or minor wording changes |
| Long prompt | Same policy, same task result | Missing constraints, loose format, task drift |
| Long tool loop | Stable tool use and final answer | Repeated calls, confusion, invented state |
If the short run passes and the long run fails, the model is probably not robust enough for unattended use at that context size.
Failure patterns to watch for
Premature cutoff
Premature cutoff is the easiest to spot. The answer stops mid-thought, ends mid-JSON, or ignores the last instruction because the effective context was already full.
Impact: downstream parsers fail, retries increase, and a tool chain may act on partial output.
Lost system instruction priority
This is the more serious bug. The model sees the system message, but later content crowds it out or changes how the model interprets it.
I look for:
- a JSON-only instruction being ignored
- safety constraints disappearing near the end of a long prompt
- the model obeying a later user request that conflicts with policy
Silent task switching
Silent task switching is when the model keeps sounding confident but starts answering a different question. This often happens after long tool output or repeated summaries.
Instead of producing the requested result, it may:
- summarize the conversation
- explain its own reasoning
- answer the nearest semantic neighbor
- follow the last tool result as if it were the user task
That is a reliability bug, even if the text reads well.
How to reduce the risk
Budgeting context and enforcing guardrails
Do not treat context as free. Keep a hard budget for history, tool output, and retrieval chunks. If a message is not needed for the next step, drop it early and deliberately.
Guardrails that actually help:
- cap tool output length
- summarize old turns into a structured state object
- keep system instructions short and repeated only where necessary
- reject completions that violate format before any action happens
Checking outputs before downstream action
Never let a model output drive a sensitive action without a validation step.
For example, if the model is selecting a tool or generating a structured command, validate:
- schema shape
- allowed tool name
- allowed argument range
- presence of required fields
A fluent answer is not a safe answer. If the model can trigger side effects, validate the structure before execution.
That validation step is where token drift stops being a UI issue and becomes a containment boundary.
Conclusion
The useful test here is simple: can the model keep its instruction priority when the prompt is long, noisy, and operationally boring? If not, treat that behavior as a deployment risk, not a lab curiosity.
My rule is straightforward: measure prompt growth, compare short and long runs, and block downstream actions whenever the output no longer matches the contract. That catches both exhaustion and drift before they turn into broken automation.


