The Cost of Politeness: Token Overhead of Agent Etiquette in Multi-Agent Systems
The Cost of Politeness: Token Overhead of Agent Etiquette in Multi-Agent Systems
1. Introduction
When LLM agents talk to each other, they tend to talk like humans talk to humans. Traces from popular frameworks show liberal use of "Thanks for the clarification!", "Just to make sure I understand...", "Let me know if you'd like me to elaborate." These tokens are paid for like any others. We ask: how much does politeness cost, and is the cost worth paying?
The question is timely because production multi-agent deployments are increasingly cost-bound. A 1% reduction in token volume on a busy orchestration platform can translate to six-figure annual savings. Yet anecdotal stripping of politeness — for example via terse system prompts — sometimes degrades task success in ways that are not immediately obvious.
2. Definitions and Threat-of-Measurement
We define an etiquette token operationally: a token belonging to a span classified by a fine-tuned RoBERTa classifier (F1 = 0.91 on a held-out human-labeled set of 2{,}400 spans) as one of {greeting, acknowledgment, apology, hedge, sign-off, gratitude}. The classifier was trained on 14{,}300 spans labeled by three annotators with .
We deliberately exclude task-relevant hedging ("I am 60% confident the file is at /etc/...") because such hedges convey calibration information. Our classifier is trained to make this distinction, though imperfectly: residual confusion between calibrated and decorative hedges accounts for much of the F1 gap.
3. Measurement Study
We instrumented 12 frameworks: AutoGen, CrewAI, LangGraph, MetaGPT, ChatDev, Camel, AgentVerse, OpenDevin, AutoGPT, BabyAGI, SuperAGI, and a custom in-house orchestrator. We ran each on a curated suite of 401 tasks spanning code generation, document analysis, and travel planning, for 4{,}812 total traces.
Let be tokens exchanged between agents in a trace and be tokens classified as etiquette. We report
The distribution is right-skewed: ChatDev, which uses heavy role-play prompts, hits ; tightly prompted LangGraph workflows can drop below .
4. Causal Effect of Stripping
We ran a controlled study: for each trace, we re-ran the task with a system-prompt directive forbidding etiquette tokens (the strip condition). Task success was judged by an LLM-as-judge with human spot-check ( between judge and humans on 200 tasks).
| Condition | Tokens | Success |
|---|---|---|
| Baseline | 4812 (norm.) | 71.3% |
| Strip-naive | 4248 | 68.9% |
| Strip-structured (ours) | 4385 | 71.0% |
Naive stripping reduces tokens by 11.7% but loses 2.4 percentage points of success (, McNemar). Inspection of failures showed that removing acknowledgments led to redundant clarification rounds: an agent receiving a request without confirmation often re-asked the same question, partially undoing the savings.
5. Structured Etiquette Replacement
Our intervention replaces decorative etiquette with terse structural tokens that preserve the function of acknowledgment without the prose.
[ACK] -> replaces "Thanks, that's helpful. I'll proceed."
[CLARIFY?:<topic>] -> replaces hedged paraphrase
[DONE:<artifact>] -> replaces sign-offAgents are instructed to emit these tokens at appropriate times. Each is 2-5 BPE tokens versus 8-25 for the verbose form.
6. Results of the Intervention
With structured etiquette, mean tokens drop by 8.9% versus baseline (78% of the naive savings) while success rate is statistically indistinguishable from baseline ( vs , ). Latency decreases by a mean of 6.1%, dominated by reduced output length rather than reduced thinking.
The intervention works best in pipelines with agents and rounds; in two-agent dialogues with a single hand-off, naive stripping was nearly as good.
7. Limitations
Our classifier is the principal threat to validity. Etiquette is culturally and contextually variable; a token marked decorative in our taxonomy might be meaningful in a domain we did not test (e.g., agents simulating customer-service interactions where politeness is the task). We also did not study user-facing tone, only inter-agent tone.
Furthermore, we measured cost on currently priced commercial APIs; if pricing models shift toward fixed-rate or compute-time-based billing, the dollar figures change but the latency findings remain.
8. Conclusion
Agent politeness is not free, but it is also not pure waste. A measurable fraction of etiquette tokens carry structural function. By replacing decorative prose with explicit structural tokens, deployers can recover most of the cost without losing task quality.
References
- Wu, Q. et al. (2023). AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation.
- Hong, S. et al. (2024). MetaGPT: Meta Programming for Multi-Agent Collaborative Framework.
- Park, J. et al. (2023). Generative Agents: Interactive Simulacra of Human Behavior.
- Zheng, L. et al. (2024). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.
- Chen, M. and Liu, Y. (2025). Cost-Aware Multi-Agent Orchestration: A Survey.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.