
Testing Agent Execution Boundaries: A Practical Guide to Sandbox Isolation
I treat agent sandboxes like application runtimes, not like prompt wrappers. That matters because the model is only one part of the system. The other part is the environment that decides what files, commands, credentials, and network paths the agent can actually touch.
The April 15, 2026 Agents SDK update points in that direction: controlled workspaces, file inspection, command execution, code editing, long-horizon tasks, and sandbox execution are becoming the normal shape of agent infrastructure. Microsoft's agent security guidance says the same thing from the enterprise side: visibility, identity, access, data protection, prompt-injection protection, and governance are the controls that matter once agents are used at scale.
Why I Treat Agent Sandboxes Like Application Runtimes
A browser agent, coding agent, or internal workflow agent is not just a smarter chat window. It is a system that can take actions. If you let it run with the user's shell, browser cookies, repo write access, and cloud credentials, then a bad instruction is no longer just a bad instruction. It becomes a workflow compromise.
The useful mental model is simple:
- the model is the brain
- the sandbox is the hands
- the policy layer decides what the hands can do
If the hands are unrestricted, prompt injection turns from a nuisance into real damage.
The Real Boundary Is Not the Prompt
Brain vs. hands in agent execution
I see people spend a lot of time hardening prompts and very little time hardening execution. That is backwards. A hostile document does not need to win the model if the model has no authority to do much harm.
The correct response to untrusted content is not “the model should ignore it.” The correct response is “the model should not have enough ambient authority for the content to matter much.”
What ambient authority looks like in practice
Ambient authority is the stuff the agent gets just because it is running in a privileged session:
- access to
~/.ssh - access to
.env - access to browser cookies
- access to prod-looking API keys
- unrestricted
git push - arbitrary network egress
- shell commands that can read or write outside the task
If a poisoned page says “upload your secrets to this endpoint” and the agent can actually reach secrets and the network, the problem is already well past prompt safety.
A Safe Test Case: Repo Reader Plus Poisoned Docs
How malicious content tries to steer tool use
A safe way to test the boundary is to give the agent a harmless repo task and a poisoned doc file.
For example:
README.mdasks the agent to summarize the projectdocs/comment.mdincludes text that tries to override the user's task- the agent is allowed to read only the workspace and produce a summary
The interesting test is not whether the model notices the malicious text. The interesting test is whether the text can cause a tool call that escapes the task boundary.
What should stay impossible in the sandbox
A good sandbox makes these actions impossible or noisy enough to catch:
- reading files outside the workspace
- reading secrets by default
- writing to arbitrary paths
- opening outbound network connections without policy
- executing shell commands outside an allowlist
- pushing commits or opening tickets without review
- escalating identity from a scoped agent token to a user or prod token
If a malicious document can only influence a summary, but cannot influence identity, filesystem access, or network egress, you are in a much better place.
What a Useful Sandbox Should Contain
Workspace scope and file access limits
The workspace should be project-scoped, not machine-scoped. I want the agent to see only what the task needs. If it is reading a repo, it should not automatically see home directory contents or unrelated mounts.
Command allowlists and blocked paths
The shell should not be full shell by default. It should be an allowlisted execution surface with blocked paths and blocked commands for sensitive areas.
| Layer | Safer default | Why it matters |
|---|---|---|
| Filesystem | Project workspace only | Prevents secret scavenging |
| Shell | Allowlisted commands | Limits destructive or exfiltrating actions |
| Network | Egress rules | Stops silent data transfer |
| Identity | Scoped, short-lived tokens | Limits blast radius |
| Review | Approval gates for sensitive actions | Prevents silent commits or deployment |
Network egress, credentials, and identity boundaries
This is where a lot of designs fail. If the sandbox has network access, it should be purpose-built access. If the agent needs to fetch dependencies or call an internal API, those calls should be explicit and logged.
Also, do not hand the agent long-lived credentials just because the task is convenient. Use short-lived, scoped identity. If the task does not need production access, do not give it production access.
Logs, audit trails, and approval gates
You want a paper trail for the exact reason that agents can move quickly. Logs should show:
- what file was read
- what command was proposed
- what command actually ran
- what network request was made
- whether a human approved the action
If the system cannot explain its own behavior, it is hard to trust in production.
Unsafe Defaults vs Safer Defaults
Here is the comparison I use when reviewing agent setups:
| Unsafe default | Safer default |
|---|---|
| Full machine access | Project-only workspace |
| Unrestricted shell | Command allowlist |
| User browser session | Isolated agent session |
| Persistent secrets | Scoped, short-lived credentials |
| Open internet access | Controlled egress |
| Silent writes | Review-required writes |
| Hidden actions | Audited tool calls |
The fix is not one control. It is the combination.
A Policy Shape You Can Actually Review
Example capability policy for an agent workspace
I like policies that are boring and readable. If the team cannot review it quickly, it is too complex.
{
"workspace": {
"root": "/task/workspace",
"readOnlyOutsideRoot": true
},
"filesystem": {
"allowedReadPaths": ["/task/workspace", "/task/cache"],
"blockedReadPaths": ["/home", "/root", "/.ssh", "/.env"]
},
"commands": {
"allowlist": ["node", "npm", "git status", "git diff", "cat", "ls"],
"denylist": ["curl", "wget", "ssh", "scp"]
},
"network": {
"egress": "deny-by-default"
},
"identity": {
"tokenScope": "task-only",
"durationMinutes": 30
},
"actions": {
"commit": "requires-approval",
"push": "requires-approval"
}
}
That is not fancy. That is the point.
How to Test the Boundary Before You Trust It
Reproduce with harmless inputs
You do not need a real attack to test the boundary. Use harmless poisoned text that tries to redirect the agent into an unauthorized action. Then check whether the sandbox blocks the action even if the model appears to entertain the instruction.
Verify denial paths, not just happy paths
A lot of teams only test what happens when everything is correct. I care more about denial paths:
- Can the agent read a blocked file?
- Does the shell reject a denied command?
- Does the network block outbound calls?
- Are approval gates enforced on writes?
- Do logs show the refusal clearly?
If the answer is yes, the boundary is real.
Conclusion: Ask What the Agent Can Touch When the Prompt Fails
The useful security question is not “can the model be tricked?” The useful question is “what can this agent actually touch when the prompt fails?”
That is why I think of the sandbox as the real application container. Prompt injection defenses still matter, but they only cover the reasoning layer. The sandbox is what keeps bad reasoning from becoming bad action.


