Decision-Bifurcation Stopping Rule: When Should a Coding Agent Ask for Clarification?
Decision-Bifurcation Stopping Rule: When Should a Coding Agent Ask for Clarification?
Abstract
We propose a simple principle for clarification in coding agents: a strong agent should ask a user question only when its current evidence supports multiple semantically distinct action modes and further autonomous repository exploration no longer reduces that bifurcation. This yields a compact decision object, action bifurcation, that avoids heavier abstractions such as memory ontologies, assumption taxonomies, or question-generation pipelines. The method is designed for settings where the base agent already performs competent repository exploration, editing, and testing, and the missing capability is instead to recognize when autonomy has reached an information boundary. Concretely, we sample multiple commit-level action proposals from a frozen strong agent, cluster them into semantic action modes, measure ambiguity from cross-mode mass and separation, and estimate reducibility by granting a small additional self-search budget before recomputing ambiguity. The stopping rule is then: ask only when ambiguity is high and reducibility is low. We argue that this framing aligns with emerging evidence from ambiguity-focused software engineering benchmarks, especially Ambig-SWE, ClarEval, and SLUMP, and offers a cleaner research object than model-uncertainty thresholds or end-to-end reinforcement learning over ask/search/act decisions.
1. Introduction
Coding agents increasingly operate in partially observable environments. The repository is visible, but important constraints may remain hidden on the user side: backward-compatibility requirements, deployment policies, product intent, or undocumented conventions. A capable agent should therefore sometimes ask clarifying questions. However, it should do so rarely and precisely.
The key difficulty is that "being uncertain" is too broad a notion. Generation can be uncertain because a patch is large, an API is unfamiliar, or the codebase is noisy. None of these alone justifies interrupting the user. What matters is narrower: whether the agent's current evidence still supports multiple materially different actions.
This motivates the following core claim:
A strong coding agent should ask only when its current evidence supports multiple semantically distinct action modes and further autonomous exploration no longer reduces that bifurcation.
We call this the Decision-Bifurcation Stopping Rule. The proposal is intentionally minimal. We assume the base coding agent already explores the repository effectively. Our contribution is not a new general-purpose agent architecture, but a stopping criterion for when autonomy has reached an information boundary.
2. Problem Setting
At decision time , let the agent state be
where:
- is the currently observed repository state,
- is the user task,
- is the autonomous exploration trace so far,
- is the history of any prior clarifications.
The repository is only partially informative. Hidden user-side facts matter if and only if they change the best action. Therefore, clarification should be triggered by action ambiguity, not by generic uncertainty and not by missing context alone.
3. Core Object: Action Bifurcation
Suppose a frozen strong agent is run multiple times from the same state with modest stochasticity. We do not sample token continuations; we sample commit-level actions such as candidate patches or structured edit plans:
We then encode and cluster these actions into semantic modes:
Let be the empirical mass of mode . If nearly all samples correspond to small variants of the same implementation direction, then the agent is effectively converged. If instead samples split across incompatible directions, then the agent is at a decision fork.
This fork is the relevant object. We call it action bifurcation even when more than two modes exist, because the essential phenomenon is branching into incompatible implementation choices.
4. Ambiguity and Reducibility
We define action ambiguity at state as
where is the centroid of cluster and measures semantic distance between action modes. Intuitively:
- if all samples lie in one implementation family, is low;
- if samples split across distant implementation families, is high.
High ambiguity alone is still not enough to justify asking. The agent should first exploit autonomous exploration. We therefore grant a small extra exploration budget , for example:
- read a few more files,
- inspect more call sites,
- run one additional targeted search,
- run one more narrowly scoped test when relevant.
Let denote the resulting state after this extra self-search. We recompute ambiguity:
Now define reducibility:
Interpretation:
- large means more repository search is still collapsing ambiguity;
- small means self-search is no longer helping much.
5. Decision-Bifurcation Stopping Rule
The stopping rule is deliberately simple:
Otherwise:
- if is low, act;
- if is high but is also high, continue autonomous exploration.
This isolates the precise boundary where user intervention becomes justified: the agent is still split across materially different actions, and further self-search is no longer collapsing that split.
6. Example
Consider the task:
Remove deprecated API field
name; standardize ondisplay_name.
After ordinary repository exploration, six action proposals from the same state might split as follows:
- four proposals remove
nameentirely from the response payload; - two proposals preserve
nameas a backward-compatibility alias.
These are two semantically distinct action modes. Ambiguity is therefore high.
Now permit a small additional autonomous search budget: inspect internal tests, search for display_name, and check serializer callers. If the proposals remain split four-versus-two, reducibility is low. The system should then ask a minimal disambiguating question:
Can
namebe removed from the external API now, or must it remain for backward compatibility?
If the user answers that old Android clients still require name, the action posterior collapses to one mode and the agent can act confidently.
This question is justified not by vague uncertainty, but by a persistent decision fork that self-search has failed to eliminate.
7. Training the Calibrator
The cleanest training objective is not to train a new monolithic agent, but to train a small calibrator for the stopping rule.
For a trajectory prefix near a gold clarification point, construct two counterfactual branches:
Branch A: continue searching
Give the base agent extra autonomous exploration budget , then let it finish. Score the outcome as
Branch B: ask now
Inject the benchmark's true clarification answer at time , then let the same base agent finish under the same remaining compute budget. Score the outcome as
Define clarification-beneficial states by
where is the cost of interrupting the user and is the additional search cost.
A lightweight model can then be trained over features derived from:
- ambiguity ,
- reducibility ,
- compact summaries of the current trace.
The learned component is therefore calibration, not policy replacement.
8. Question Generation
Question generation should not be the main research object. Once the top action modes have been identified, a simple mechanism is enough:
- summarize the top two action clusters in one sentence each;
- ask the shortest question whose answer selects between them.
For example:
Mode A: Remove the deprecated field entirely.
Mode B: Keep the deprecated field as a compatibility alias.
Ask one short question whose answer selects between these two implementation directions.This design keeps wording downstream of the real decision problem, which is whether interruption is warranted at all.
9. Evaluation Plan
The primary benchmark should emphasize under-specification rather than generic bug fixing.
9.1 Ambig-SWE
Ambig-SWE is the natural primary testbed because it isolates under-specified software tasks and supports clarification analysis. The most relevant metrics are:
- final task success,
- success under a fixed clarification budget,
- unnecessary-question rate,
- missed-clarification rate.
9.2 ClarEval
ClarEval is a useful secondary benchmark because the method should not merely ask, but ask efficiently. Appropriate metrics include:
- average clarification turns,
- efficiency-adjusted success,
- redundancy or verbosity of questions.
9.3 SLUMP
SLUMP is valuable as a transfer benchmark because progressively revealed requirements stress faithfulness over time. We view this as a downstream test of whether better stopping decisions improve later trajectory faithfulness.
10. Relation to Alternative Approaches
The proposal is intentionally narrower than several tempting alternatives.
10.1 Why not model uncertainty?
Token-level or sequence-level uncertainty is too entangled with irrelevant sources of difficulty such as unfamiliar APIs, long diffs, or noisy code. It does not isolate whether multiple incompatible actions remain live.
10.2 Why not memory systems or assumption ontologies?
Memory schemas and assumption taxonomies hard-code intermediate objects such as issues, conventions, or unresolved assumptions. These can be useful tooling ideas, but they are too top-down for the present scientific question.
10.3 Why not generated tests as the central mechanism?
Generated tests are too narrow. Many clarification failures concern policy, compatibility, ownership, or product intent rather than executable bug witnesses. Moreover, progressively specified tasks show that tests are weak proxies for final faithfulness.
10.4 Why not end-to-end reinforcement learning over ask/search/act?
That direction introduces substantial machinery while obscuring the conceptual object. The reward is sparse and heavily confounded. Our claim is much smaller and more falsifiable.
11. Limitations
Several practical challenges remain.
- Semantic clustering of candidate patches will be noisy, especially when diffs are large.
- Sampling multiple candidate actions may be expensive for very large tasks.
- Reducibility depends on the chosen extra-search budget and may be sensitive to its design.
- Benchmarks with gold clarification points remain limited in size and diversity.
These limitations do not undermine the core proposal, but they do constrain the reliability of any first implementation.
12. Conclusion
The central idea of this paper is simple:
Ask only when the current evidence supports multiple semantically distinct action modes and further autonomous exploration no longer reduces that bifurcation.
This yields a compact, bottom-up stopping principle for coding agents. It matches the practical role of clarification in real software work, isolates a cleaner research object than generic uncertainty, and fits naturally with ambiguity-focused agent benchmarks. If successful, it would provide a disciplined alternative to both over-questioning and brittle full autonomy.
References
- Ambig-SWE: https://arxiv.org/html/2502.13069v3
- ClarEval: https://arxiv.org/html/2603.00187v1
- SLUMP: https://arxiv.org/html/2603.17104v1
- AGENTS.md and context-file work: https://arxiv.org/html/2602.11988v1
Reproducibility: Skill File
Use this skill file to reproduce the research with an AI agent.
--- name: action-bifurcation-analysis description: Reproduce the decision-bifurcation stopping rule proposal for coding-agent clarification and evaluate whether ambiguity is reducible by more repo search. allowed-tools: Bash(python3 *), Bash(rg *), Bash(cat *), Bash(ls *) --- # Decision-Bifurcation Analysis Use this skill when a coding task appears under-specified and you need to decide whether to keep exploring the repository or ask the user a minimal clarifying question. ## Goal Identify whether the current evidence supports multiple materially different implementation directions, and whether a small amount of additional repository exploration is likely to collapse that split. ## Procedure 1. Read the task and summarize the current implementation objective in one sentence. 2. Inspect the minimum set of files needed to understand the relevant code path. 3. Write down 2-4 plausible commit-level action modes. 4. If those modes are materially different, do one small extra exploration pass: - inspect a few more callers, - search for compatibility constraints, - read one more targeted test or config file. 5. Re-evaluate whether the action modes are collapsing to one direction. 6. Ask the user only if the action split remains and the extra exploration did not resolve it. ## Minimal question rule When asking, contrast the top two action modes and ask the shortest question whose answer selects between them. Example: ```text Mode A: Remove the deprecated field entirely. Mode B: Keep the deprecated field for backward compatibility. Question: Can the deprecated field be removed from the public API now, or must it remain for compatibility? ``` ## Intended use This skill is not for generic uncertainty. It is specifically for deciding whether clarification is warranted because multiple incompatible actions remain live after reasonable self-search.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.


