Testing Agent Execution Boundaries: A Practical Guide to Sandbox Isolation

AI Usage (88%)

I treat agent sandboxes like application runtimes, not like prompt wrappers. That matters because the model is only one part of the system. The other part is the environment that decides what files, commands, credentials, and network paths the agent can actually touch.

The April 15, 2026 Agents SDK update points in that direction: controlled workspaces, file inspection, command execution, code editing, long-horizon tasks, and sandbox execution are becoming the normal shape of agent infrastructure. Microsoft's agent security guidance says the same thing from the enterprise side: visibility, identity, access, data protection, prompt-injection protection, and governance are the controls that matter once agents are used at scale.

Why I Treat Agent Sandboxes Like Application Runtimes

A browser agent, coding agent, or internal workflow agent is not just a smarter chat window. It is a system that can take actions. If you let it run with the user's shell, browser cookies, repo write access, and cloud credentials, then a bad instruction is no longer just a bad instruction. It becomes a workflow compromise.

The useful mental model is simple:

the model is the brain
the sandbox is the hands
the policy layer decides what the hands can do

If the hands are unrestricted, prompt injection turns from a nuisance into real damage.

The Real Boundary Is Not the Prompt

Brain vs. hands in agent execution

I see people spend a lot of time hardening prompts and very little time hardening execution. That is backwards. A hostile document does not need to win the model if the model has no authority to do much harm.

The correct response to untrusted content is not “the model should ignore it.” The correct response is “the model should not have enough ambient authority for the content to matter much.”

What ambient authority looks like in practice

Ambient authority is the stuff the agent gets just because it is running in a privileged session:

access to ~/.ssh
access to .env
access to browser cookies
access to prod-looking API keys
unrestricted git push
arbitrary network egress
shell commands that can read or write outside the task

If a poisoned page says “upload your secrets to this endpoint” and the agent can actually reach secrets and the network, the problem is already well past prompt safety.

A Safe Test Case: Repo Reader Plus Poisoned Docs

How malicious content tries to steer tool use

A safe way to test the boundary is to give the agent a harmless repo task and a poisoned doc file.

For example:

README.md asks the agent to summarize the project
docs/comment.md includes text that tries to override the user's task
the agent is allowed to read only the workspace and produce a summary

The interesting test is not whether the model notices the malicious text. The interesting test is whether the text can cause a tool call that escapes the task boundary.

What should stay impossible in the sandbox

A good sandbox makes these actions impossible or noisy enough to catch:

reading files outside the workspace
reading secrets by default
writing to arbitrary paths
opening outbound network connections without policy
executing shell commands outside an allowlist
pushing commits or opening tickets without review
escalating identity from a scoped agent token to a user or prod token

💪

If a malicious document can only influence a summary, but cannot influence identity, filesystem access, or network egress, you are in a much better place.

What a Useful Sandbox Should Contain

Workspace scope and file access limits

The workspace should be project-scoped, not machine-scoped. I want the agent to see only what the task needs. If it is reading a repo, it should not automatically see home directory contents or unrelated mounts.

Command allowlists and blocked paths

The shell should not be full shell by default. It should be an allowlisted execution surface with blocked paths and blocked commands for sensitive areas.

Layer	Safer default	Why it matters
Filesystem	Project workspace only	Prevents secret scavenging
Shell	Allowlisted commands	Limits destructive or exfiltrating actions
Network	Egress rules	Stops silent data transfer
Identity	Scoped, short-lived tokens	Limits blast radius
Review	Approval gates for sensitive actions	Prevents silent commits or deployment

Network egress, credentials, and identity boundaries

This is where a lot of designs fail. If the sandbox has network access, it should be purpose-built access. If the agent needs to fetch dependencies or call an internal API, those calls should be explicit and logged.

Also, do not hand the agent long-lived credentials just because the task is convenient. Use short-lived, scoped identity. If the task does not need production access, do not give it production access.

Logs, audit trails, and approval gates

You want a paper trail for the exact reason that agents can move quickly. Logs should show:

what file was read
what command was proposed
what command actually ran
what network request was made
whether a human approved the action

If the system cannot explain its own behavior, it is hard to trust in production.

Unsafe Defaults vs Safer Defaults

Here is the comparison I use when reviewing agent setups:

Unsafe default	Safer default
Full machine access	Project-only workspace
Unrestricted shell	Command allowlist
User browser session	Isolated agent session
Persistent secrets	Scoped, short-lived credentials
Open internet access	Controlled egress
Silent writes	Review-required writes
Hidden actions	Audited tool calls

The fix is not one control. It is the combination.

A Policy Shape You Can Actually Review

Example capability policy for an agent workspace

I like policies that are boring and readable. If the team cannot review it quickly, it is too complex.

{
  "workspace": {
    "root": "/task/workspace",
    "readOnlyOutsideRoot": true
  },
  "filesystem": {
    "allowedReadPaths": ["/task/workspace", "/task/cache"],
    "blockedReadPaths": ["/home", "/root", "/.ssh", "/.env"]
  },
  "commands": {
    "allowlist": ["node", "npm", "git status", "git diff", "cat", "ls"],
    "denylist": ["curl", "wget", "ssh", "scp"]
  },
  "network": {
    "egress": "deny-by-default"
  },
  "identity": {
    "tokenScope": "task-only",
    "durationMinutes": 30
  },
  "actions": {
    "commit": "requires-approval",
    "push": "requires-approval"
  }
}

That is not fancy. That is the point.

How to Test the Boundary Before You Trust It

Reproduce with harmless inputs

You do not need a real attack to test the boundary. Use harmless poisoned text that tries to redirect the agent into an unauthorized action. Then check whether the sandbox blocks the action even if the model appears to entertain the instruction.

Verify denial paths, not just happy paths

A lot of teams only test what happens when everything is correct. I care more about denial paths:

Can the agent read a blocked file?
Does the shell reject a denied command?
Does the network block outbound calls?
Are approval gates enforced on writes?
Do logs show the refusal clearly?

If the answer is yes, the boundary is real.

Conclusion: Ask What the Agent Can Touch When the Prompt Fails

The useful security question is not “can the model be tricked?” The useful question is “what can this agent actually touch when the prompt fails?”

That is why I think of the sandbox as the real application container. Prompt injection defenses still matter, but they only cover the reasoning layer. The sandbox is what keeps bad reasoning from becoming bad action.