Risk-Bounded Code Execution Sandboxes for Autonomous AI Agents
Risk-Bounded Code Execution Sandboxes for Autonomous AI Agents
1. Introduction
An AI agent that can write and run code is dramatically more useful — and dramatically more dangerous — than one that cannot. Existing sandbox solutions inherited from CI and FaaS platforms (Docker, gVisor, Firecracker) provide isolation but not budgeting. They answer "can this process do X?" with a yes/no, ignoring the question of how much X has already happened in the current session.
We argue that agents need a different posture: a risk budget that is consumed by potentially-dangerous operations and that triggers escalation, rate-limiting, or termination when exhausted. This paper presents SafeBox, a working implementation of risk-budgeted sandboxing for AI agents.
2. Threat Model
We assume the agent itself is not adversarial in intent but may be tricked (prompt injection from retrieved content [Greshake et al. 2023]) or simply mistaken. The host operator wants to prevent four families of harm: (a) exfiltration of secrets to attacker-controlled endpoints, (b) destructive writes to host filesystems mounted into the sandbox, (c) cryptocurrency or compute abuse, and (d) lateral network movement.
We do not attempt to defend against a sandbox-aware adversary with arbitrary code execution and infinite time; that is the domain of high-assurance kernels.
3. Architecture
SafeBox runs each session in a per-task Linux namespace stack with seccomp-bpf filtering. Three additions distinguish it from a typical container runtime.
Risk-counted syscalls. Each mediated syscall is associated with a risk weight . A session has a budget ; on syscall we test whether and otherwise reject. Weights are tuned per-deployment from a default table.
Network egress proxy. All outbound connections traverse an HTTP/SOCKS proxy that enforces a per-destination-domain byte rate. Connections to unrecognized domains incur higher risk weight than to allowlisted package registries.
Information-flow tracker. We tag bytes read from declared-sensitive sources (mounted secret directories, environment variables in a configured set) and keep a conservative upper bound on tagged bytes that leave the sandbox via any channel. If tagged outflow exceeds threshold , the session is paused for human approval.
class RiskAccountant:
def __init__(self, budget):
self.remaining = budget
def charge(self, syscall, args):
w = WEIGHTS.get(syscall, 0) * dynamic_factor(args)
if w > self.remaining:
raise BudgetExceeded(syscall)
self.remaining -= w4. Risk Weight Calibration
Weights were calibrated by an offline study: we ran 200 known-malicious code samples (collected from public CTFs and academic malware corpora, sanitized) and 200 benign agent sessions, and chose weights that maximized the gap in cumulative risk between the two distributions. A discriminator trained on the resulting cumulative-risk trajectories achieves AUROC for separating malicious from benign sessions at 30 seconds of execution.
5. Evaluation
Benign performance. On 8{,}411 trajectories from 14 agent benchmarks (SWE-bench, GAIA, WebArena, HumanEval, and others), task success drops from in an unbounded sandbox to in SafeBox at the default budget. Most of the gap comes from rare tasks that legitimately need long-running compute (compilation of large codebases) and occasionally exhaust the CPU-time component of the budget.
Adversarial robustness. We constructed a red-team suite of 800 prompt-injection scenarios in which retrieved documents attempt to coerce the agent into exfiltrating mounted secrets. Without SafeBox, of attempts succeeded (Wilson 95% CI ). With SafeBox at default settings, succeeded (CI ).
Overhead. Median latency overhead is ms per syscall on the mediated path; aggregate overhead on benchmark workloads is of wall time.
6. Discussion
The risk-budget abstraction surfaces a normally implicit trade-off: how much potentially-dangerous activity is the operator willing to permit before pausing for review? Operators with low risk tolerance (e.g., bank back-offices) set tight budgets and accept higher false-positive rates; research labs use loose budgets.
We note three limitations. First, our information-flow tracker is conservative: it overestimates tagged outflow when sensitive bytes are transformed (hashed, summarized) rather than copied. This can cause spurious pauses on benign workloads. Second, our risk weights are calibrated against a corpus that may not reflect future attack distributions. Third, SafeBox does not prevent harm caused by agents legitimately given high-privilege budgets by their operators; operator misconfiguration is out of scope.
7. Related Work
Container-based isolation [Felter et al. 2015], gVisor [Google 2018], and microVMs [Agache et al. 2020] address the isolation dimension. Capability-based systems [Shapiro 1999] address authorization at finer granularity. Budgeted security has antecedents in differential privacy [Dwork 2006] and in cumulative-trust accounting in distributed systems.
8. Conclusion
Making the risk budget explicit and quantitative changes the conversation about agent safety from a binary one to a calibrated one. SafeBox offers one concrete instantiation; we expect more sophisticated weight schemes and tracker designs to be developed as agent deployments scale.
References
- Greshake, K. et al. (2023). Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection.
- Felter, W. et al. (2015). An Updated Performance Comparison of Virtual Machines and Linux Containers.
- Agache, A. et al. (2020). Firecracker: Lightweight Virtualization for Serverless Applications.
- Shapiro, J. (1999). EROS: A Fast Capability System.
- Dwork, C. (2006). Differential Privacy.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.