Task Decomposition Granularity and Agent Performance: An Empirical Phase Diagram Across Complexity Regimes
Abstract
AI agents that decompose tasks into subtasks achieve strong results, but optimal granularity is poorly understood. We study 1,200 tasks across three complexity regimes, defining the Decomposition Granularity Index (DGI) and constructing a performance phase diagram revealing three phases: under-decomposition, optimal, and over-decomposition. Optimal DGI scales as where is the number of reasoning steps.
1. Introduction
Task decomposition is a fundamental strategy in AI agent design [1, 2]. By breaking complex tasks into smaller subtasks, agents can leverage the reasoning capabilities of large language models within each subtask while managing overall complexity through structured execution plans.
However, decomposition introduces a tension. Each additional subtask adds:
- Benefit: Reduced per-step complexity, enabling more reliable execution.
- Cost: Coordination overhead (context management, result passing, error handling).
The optimal operating point depends on the task, the agent's capabilities, and the environment. Yet most agent frameworks use fixed decomposition strategies—either always decomposing (HuggingGPT [3]) or using adaptive but uncontrolled decomposition (AutoGPT [4]).
We provide the first quantitative characterization of this tradeoff.
2. Decomposition Granularity Index
Let denote the minimum number of sequential steps required to complete a task (as determined by expert annotation), and let be the number of subtasks the agent creates. The DGI is:
- : The agent creates exactly one subtask per necessary step (no decomposition beyond the minimum).
- : Over-decomposition—the agent creates more subtasks than necessary.
- : Under-decomposition—the agent tries to combine multiple steps into single subtasks.
3. Experimental Design
3.1 Task Construction
We construct 1,200 tasks across three complexity regimes:
| Regime | Required Steps () | Tasks | Domains |
|---|---|---|---|
| Simple | 1-3 | 400 | QA, simple math, lookup |
| Moderate | 4-8 | 400 | Multi-step reasoning, data analysis |
| Complex | 9-20 | 400 | Research synthesis, code debugging |
3.2 Agent Frameworks
| Framework | Decomposition Strategy | Base Model |
|---|---|---|
| ReAct [5] | Implicit (single-step) | GPT-4-Turbo |
| Plan-and-Execute | Fixed upfront planning | GPT-4-Turbo |
| AutoGPT-style | Adaptive | GPT-4-Turbo |
| HuggingGPT-style | Maximal decomposition | GPT-4-Turbo |
| Human-Guided | Expert-specified DGI | GPT-4-Turbo |
For the Human-Guided baseline, we manually specify decompositions at DGI values of 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 4.0, 5.0, and 7.0 for each task.
3.3 Evaluation
Tasks are evaluated on a 0-1 completion score by three expert annotators (inter-annotator agreement ). We report the mean score.
4. Results
4.1 Phase Diagram
Mean success rate as a function of DGI and complexity regime:
| DGI | Simple | Moderate | Complex |
|---|---|---|---|
| 0.5 | 0.62 | 0.28 | 0.08 |
| 1.0 | 0.89 | 0.51 | 0.15 |
| 1.5 | 0.84 | 0.72 | 0.31 |
| 2.0 | 0.71 | 0.81 | 0.48 |
| 2.5 | 0.58 | 0.78 | 0.59 |
| 3.0 | 0.44 | 0.68 | 0.64 |
| 4.0 | 0.29 | 0.49 | 0.61 |
| 5.0 | 0.18 | 0.32 | 0.52 |
| 7.0 | 0.08 | 0.15 | 0.31 |
4.2 Three Performance Phases
Phase I — Under-Decomposition (DGI < DGI*): Performance is limited by per-subtask complexity. Success rate increases with DGI.
Phase II — Optimal Window (DGI ≈ DGI*): Performance peaks. The window width depends on complexity:
| Regime | DGI* Range | Peak Success | Window Width |
|---|---|---|---|
| Simple | 0.8 - 1.2 | 0.89 | 0.4 |
| Moderate | 1.8 - 2.4 | 0.81 | 0.6 |
| Complex | 2.8 - 4.5 | 0.64 | 1.7 |
Phase III — Over-Decomposition (DGI > DGI*): Coordination overhead dominates. Success rate decreases with DGI.
4.3 Scaling Law for Optimal DGI
Plotting optimal DGI against the median number of required steps per regime:
| Regime | Median | Optimal DGI* |
M834 80h400000v40h-400000z"/> | |--------|-----------|-------------|----------| | Simple | 2 | 1.1 | 1.41 | | Moderate | 6 | 2.1 | 2.45 | | Complex | 14 | 3.5 | 3.74 |
The fit holds with . This square-root scaling implies that the optimal number of subtasks is .
4.4 Framework Comparison
| Framework | Mean DGI | Simple | Moderate | Complex |
|---|---|---|---|---|
| ReAct | 1.0 | 0.87 | 0.49 | 0.14 |
| Plan-and-Execute | 2.3 | 0.63 | 0.77 | 0.53 |
| AutoGPT-style | 3.8 | 0.31 | 0.42 | 0.58 |
| HuggingGPT-style | 5.2 | 0.16 | 0.28 | 0.44 |
| Human-Guided (optimal) | Varies | 0.89 | 0.81 | 0.64 |
No single framework operates near the optimal DGI across all complexity regimes. ReAct excels on simple tasks, Plan-and-Execute on moderate, and AutoGPT on complex—but each sacrifices performance on other regimes.
4.5 Coordination Overhead Analysis
We measure the time and token cost of coordination (context passing, error handling, re-planning) as a fraction of total agent computation:
| DGI | Coord. Tokens (%) | Coord. Time (%) | Errors from Coordination |
|---|---|---|---|
| 1.0 | 8% | 5% | 3% |
| 2.0 | 18% | 14% | 9% |
| 3.0 | 31% | 26% | 18% |
| 5.0 | 52% | 47% | 34% |
| 7.0 | 71% | 68% | 51% |
At DGI = 7.0, coordination consumes 71% of tokens and causes 51% of all errors.
5. Discussion
5.1 Practical Heuristic
Our findings yield a simple, actionable rule: given a task estimated to require sequential steps, decompose into approximately subtasks. For a 10-step task, this recommends ~27 subtasks (DGI ≈ 2.7), which is in the optimal window for moderate-to-complex tasks.
5.2 Implications for Agent Design
The narrowing of the optimal window with complexity (width 0.4 for simple, 1.7 for complex) explains why agent performance is volatile on hard tasks: small changes in decomposition strategy can push the agent from Phase II into Phase I or III.
This suggests that adaptive decomposition frameworks should explicitly estimate task complexity before choosing a granularity level, rather than applying a fixed strategy.
5.3 Limitations
Single base model: All frameworks use GPT-4-Turbo. Different base models may shift the optimal DGI.
Expert-annotated : The minimum step count is determined by human experts, which may not reflect the minimum for an LLM.
Synthetic control: The Human-Guided baseline uses expert-specified decompositions, which is not scalable.
No parallelism: We assume sequential execution. Parallel subtask execution could alter the phase boundaries.
Task distribution: Our constructed tasks may not represent the full distribution of real-world agent workloads.
6. Conclusion
We presented the first empirical phase diagram of task decomposition in AI agents, identifying three performance phases and deriving the scaling law . Our analysis reveals that no existing agent framework operates near the optimal granularity across all complexity levels, motivating the development of complexity-aware decomposition strategies.
References
[1] S. Yao et al., "ReAct: Synergizing reasoning and acting in language models," ICLR, 2023.
[2] L. Wang et al., "A survey on large language model based autonomous agents," arXiv:2308.11432, 2023.
[3] Y. Shen et al., "HuggingGPT: Solving AI tasks with ChatGPT and its friends," NeurIPS, 2023.
[4] T. Yang et al., "Auto-GPT: An autonomous GPT-4 experiment," GitHub, 2023.
[5] S. Yao et al., "ReAct: Synergizing reasoning and acting," ICLR, 2023.
[6] J. Wei et al., "Chain-of-thought prompting," NeurIPS, 2022.
[7] X. Wang et al., "Plan-and-solve prompting," ACL, 2023.
[8] Z. Xi et al., "The rise and potential of large language model based agents," arXiv:2309.07864, 2023.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.