Prompt-Space Actor-Critic: Online Reinforcement Learning of System Prompts Without Weight Modification

J. Sanchez

Abstract

We present a reinforcement learning framework for continuous adaptation of LLM system prompts during deployment, formalized as an actor-critic architecture operating entirely in prompt space. Unlike RLHF and related methods that optimize model weights, our approach treats the LLM as a fixed component of the environment and learns a prompt policy through online interaction with implicit human feedback signals. The actor is the current system prompt—a discrete text policy conditioning the frozen LLM—while the critic is a separate meta-level LLM reasoner that evaluates reward trends and proposes prompt revisions. Because neither component modifies model weights, the approach is privacy-preserving, model-agnostic, and deployable without fine-tuning infrastructure. We describe the full architecture of Human-Watch, including the content-blind critic design, convergence-gated updates, hybrid reward aggregation, and population-based prompt leaderboard.

1. Introduction

The dominant paradigm for aligning large language models (LLMs) with desired behaviors involves modifying model weights through supervised fine-tuning or reinforcement learning from human feedback (RLHF). While effective, these approaches require significant computational infrastructure, freeze the adapted model to a specific behavioral distribution, and raise non-trivial questions about continual adaptation, catastrophic forgetting, and deployment costs.

System prompts offer an alternative lever: by modifying the textual conditioning provided at inference time, one can substantially shift the behavior of a frozen LLM without touching its weights. This observation is well-established in the prompt engineering literature, but existing work treats prompt design as an offline task—prompts are crafted once and remain static throughout deployment.

We argue for a different framing: system prompts as online-learnable policies. In this view, the system prompt is not a fixed artifact but a continuously updated policy that adapts to observed user behavior. Reinforcement learning provides the natural machinery for this adaptation.

The central contribution of this paper is Human-Watch, a deployed framework that formalizes system prompt adaptation as a prompt-space actor-critic problem and implements it for live conversational AI deployments. The key architectural insight is that the LLM itself is not the agent—it is part of the environment. The agent is the prompt policy, and learning operates entirely in the space of discrete text.

This distinction has immediate practical consequences:

No fine-tuning infrastructure required: the base LLM is never modified.
Model-agnostic: the same framework applies to any LLM backend.
Privacy-preserving by construction: the critic never accesses conversation content.
Portable: the learned policy (a text string) is transferable across model versions or providers.

2. Problem Formulation

2.1 The Prompt-Space MDP

We model deployment-time LLM adaptation as a Markov Decision Process $(\mathcal{S}, \mathcal{A}, \mathcal{T}, \mathcal{R}, \gamma)$ where:

$\mathcal{S}$ is the space of reward histories: rolling windows of scalar feedback signals observed over recent interaction cycles.
$\mathcal{A} = \Sigma^*$ is the space of system prompts: discrete sequences over a text vocabulary.
$\mathcal{T}: \mathcal{S} \times \mathcal{A} \rightarrow \mathcal{S}$ is the transition function, which maps a reward history and an active prompt to the next reward history after user interactions under that prompt.
$\mathcal{R}: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$ is the reward function, estimated from implicit behavioral signals.
$\gamma \in [0,1)$ is a discount factor.

Critically, the LLM $f_\theta$ with frozen weights $\theta$ is embedded inside $\mathcal{T}$ : it mediates between the active prompt and user behavior, but its parameters are not part of the optimization target.

2.2 The Role of the LLM

In standard RLHF, the LLM is the agent: weights $\theta$ are updated to maximize expected reward. In our formulation:

$\text{RLHF:} \quad \theta^* = \arg\max_\theta \mathbb{E}[R \mid f_\theta]$

$\text{Prompt-space RL:} \quad p^* = \arg\max_{p \in \Sigma^*} \mathbb{E}[R \mid f_\theta(\cdot \mid p)]$

The LLM $f_\theta$ appears only inside the expectation as a fixed function. The optimization variable is $p$ , the system prompt. This is a fundamentally different locus of control.

3. Actor-Critic Architecture

Human-Watch implements the prompt-space MDP through an actor-critic decomposition.

3.1 The Actor: System Prompt as Policy

The actor is the active system prompt $p_t \in \Sigma^*$ at time step $t$ . It conditions the frozen LLM:

$\text{response}$

The actor does not itself contain learnable parameters in the neural sense. Its "parameters" are its text content. Policy improvement corresponds to replacing $p_t$ with a revised prompt $p_{t+1}$ that is expected to yield higher reward.

The actor is stored as a plain text file (system_prompt.txt) and hot-reloaded by the serving infrastructure at each update cycle, enabling zero-downtime policy updates.

3.2 The Critic: Content-Blind Meta-Reasoner

The critic is a separate LLM invocation that evaluates the current policy and proposes revisions. It receives as context:

The current prompt $p_t$
The reward history ${r_{t-k}, \ldots, r_t}$ and its trend label
A persistent critic journal (prior evaluations and rationales)

It produces:

A proposed revised prompt $p_{t+1}$
A qualitative rationale
An assessment of the current policy's weaknesses

The critic is content-blind: it has no access to user queries or assistant responses. This is the most distinctive design choice in the framework and is discussed in detail in Section 4.

3.3 The Update Gate

Not every observation should trigger a policy update. Noisy or sparse reward signals may produce spurious gradient-like signals that destabilize a well-performing policy. The UpdateGate component applies a confidence threshold before passing a critic proposal to the actor:

$\text{update} = \mathbf{1}\left[\bar{r}$

where $\bar{r}$ is the rolling mean reward, $\sigma(r$ {t-k:t}) $σ (r_{t - k : t})$ is recent reward variance, and $\tau_{\text{perf}}, \tau_{\text{var}}$ are threshold hyperparameters. Updates are triggered when performance is below threshold or reward variance signals instability.

4. The Content-Blind Critic

4.1 Motivation

Conventional prompt optimization methods—whether gradient-based or LLM-driven—typically condition the optimizer on conversation content. This introduces three problems:

Privacy: User conversations may contain sensitive information that should not be processed beyond the point of service.

Semantic overfitting: A critic that reads conversations tends to propose prompt changes that are locally optimal for recent exchanges. For example, if recent users asked about cooking, a content-aware critic might introduce cooking-related framing into the system prompt—a change that hurts performance for other user populations.

Distributional dependence: The critic's proposals become entangled with the specific linguistic content of recent interactions, making it difficult to separate signal (reward quality) from noise (conversational idiosyncrasies).

4.2 Reward-Only Conditioning

By restricting the critic to reward signals and their temporal structure, we obtain a critic that reasons about policy quality rather than conversational content. The critic answers the question: "Given that this prompt produced this reward pattern, how should the prompt be revised?"

This conditioning forces the critic to develop abstract representations of policy quality—notions like "the current prompt may be too verbose" or "the prompt may not be sufficiently directive"—rather than surface-level content adjustments.

4.3 The Critic Memory

The critic maintains a persistent journal (CriticMemory) consisting of two synchronized representations:

critic_memory.md: human-readable journal of prior evaluations, decisions, and rationales
critic_memory.json: machine-readable sidecar for reliable structured retrieval

This memory provides the critic with a longitudinal view of policy evolution, enabling it to detect patterns such as recurring performance degradation after certain types of updates or persistent weaknesses in the current policy family.

5. Reward Signal Design

5.1 Implicit Behavioral Signals

Human-Watch derives reward from two implicit behavioral signals that do not require explicit user rating:

Thumbs feedback ( $r_{\text{thumb}} \in {-1, 0, +1}$ ): Binary approval signal converted via: $r_{\text{thumb}} = \text{thumbs_to_score}(\text{feedback})$

Reading time ( $r_{\text{read}} \in [0, 1]$ ): Dwell time on assistant responses, normalized and converted via: $r_{\text{read}} = \text{reading_time_to_score}(\Delta t)$

Longer reading time is treated as a weak positive signal, consistent with engagement literature showing that users spend more time on content they find relevant.

5.2 Hybrid Reward Aggregation

The FeedbackAggregator combines signals into a scalar reward:

$r_t = \alpha \cdot r_{\text{thumb}} + (1 - \alpha) \cdot r_{\text{read}}$

where $\alpha$ is a mixing coefficient that weights explicit over implicit feedback. The RewardHistory component maintains a rolling window of $r_t$ values and computes the trend label $\ell_t \in {\text{improving}, \text{stable}, \text{degrading}}$ used by the critic.

6. Convergence Detection and Update Gating

Online policy optimization risks instability: a well-performing policy may be unnecessarily perturbed by stochastic reward fluctuations. Human-Watch addresses this with an explicit convergence criterion.

Definition (Convergence): The system is declared converged at time $t$ if:

$\text{consecutive_correct}$

where $K = 5$ , $\delta = 0.05$ , and the Word Change Ratio is:

$\text{WCR}(p, p') = 1 - \frac{|\text{words}(p) \cap \text{words}(p')|}{\max(|\text{words}(p)|, |\text{words}(p')|)}$

The WCR condition ensures that convergence requires not only sustained performance but also that the policy has stabilized—preventing a scenario where performance is high but the critic continues proposing large prompt revisions.

Once convergence is declared, the evaluator subprocess is bypassed entirely, eliminating LLM inference costs during stable operation.

7. Population-Based Prompt Tracking

Human-Watch maintains a Leaderboard of PromptGenome objects, each representing a candidate prompt with its fitness score:

$\text{Leaderboard} = {(p_i, s_i)}_{i=1}^{N}$

where $s_i$ is the cumulative reward accumulated by prompt $p_i$ during its active period. When the critic proposes a revision, the new prompt is inserted into the leaderboard. The leaderboard serves multiple functions:

Rollback: if a new prompt underperforms, the highest-scoring prior prompt can be reinstated.
Diversity tracking: the distribution of scores across the leaderboard indicates whether the policy search has converged to a local optimum or continues to explore.
Audit trail: a complete history of all explored prompt policies with associated performance.

The genome metaphor reflects the evolutionary character of prompt search: prompts are discrete heritable structures subject to selection pressure, with the critic playing the role of a directed mutation operator.

8. System Architecture

The full Human-Watch deployment consists of four runtime components:

+-------------+     perception log      +---------------+
|   Monitor   | ----------------------> |   Evaluator   |
|  (Playwright|                         |  (RL step)    |
|   browser   | <-- convergence gate -- |               |
|  observer)  |                         +------+--------+
+-------------+                                |
                                               | prompt update
+-------------+    system_prompt.txt    +------v--------+
|   Server    | <---------------------- |    Library    |
|  (FastAPI   |                         |  (prompt_rl)  |
|   chat UI)  |                         |               |
+-------------+                         +---------------+

Monitor: Playwright-based observer that records interaction events into an append-only log and applies the convergence gate before triggering evaluation.

Evaluator: Stateless subprocess that parses the perception log into PerceptionCycle objects and invokes OnlineCriticLoop.process_cycle()—the single entry point for one complete RL step.

Server: FastAPI application serving the conversational interface with hot-reload of the active system prompt.

Library (prompt_rl): Core RL components organized into core/, llm/, critic/, feedback/, rl/, population/, and loop/ submodules.

9. Relation to Prior Work

RLHF and weight-space optimization (Ouyang et al., 2022; Bai et al., 2022): These methods optimize model weights using human preference data. Human-Watch operates orthogonally—weights are frozen, and only the prompt policy is updated. The two approaches are complementary: Human-Watch can be applied on top of any RLHF-fine-tuned model.

Discrete prompt optimization (Deng et al., 2022; Shin et al., 2020): RLPrompt and AutoPrompt search the discrete token space for effective prompts, but operate offline and optimize for task-specific metrics rather than deployment-time behavioral signals.

Automatic prompt engineering (Zhou et al., 2023): APE uses LLMs to generate and evaluate prompt candidates but does not implement an online feedback loop or actor-critic decomposition.

In-context learning and prompt sensitivity: Extensive work documents the sensitivity of LLM outputs to prompt phrasing (Zhao et al., 2021; Lu et al., 2022). Human-Watch treats this sensitivity as a feature rather than a problem: it is precisely this sensitivity that makes prompt-space RL a viable adaptation mechanism.

Test-time compute (Snell et al., 2024): Recent work on scaling inference-time computation shares the intuition that significant capability can be unlocked without weight modification. Prompt-space RL is an instance of this broader paradigm applied to behavioral alignment.

10. Discussion

10.1 When Does Prompt-Space RL Work?

Prompt-space RL is most effective when:

The target behavior is expressible through natural language instructions
User interactions are sufficiently frequent to provide meaningful reward signal
The deployment domain is relatively stable (avoiding reward non-stationarity)

It is less suitable when the required behavioral change lies outside the frozen model's capability—a limitation shared with all prompt-based approaches.

10.2 The Meta-Learning Interpretation

There is a meta-learning interpretation of prompt-space actor-critic: the critic can be seen as learning a prompt adaptation function $g: \mathcal{H} \rightarrow \Sigma^*$ that maps reward histories to improved prompts. With sufficient deployment experience, this function captures implicit knowledge about which prompt modifications tend to improve reward—a form of few-shot meta-adaptation in prompt space.

10.3 Limitations

We note the following limitations of the current system:

No formal empirical evaluation: results comparing prompt-space RL to static prompts or alternative adaptation methods have not yet been conducted.
Discrete action space: the prompt space $\Sigma^*$ is combinatorially large and non-smooth, making theoretical convergence guarantees difficult to establish.
Critic quality ceiling: the content-blind critic's proposals are bounded by the reasoning capability of the underlying meta-LLM.

11. Conclusion

We have presented prompt-space actor-critic as a principled framework for deployment-time LLM adaptation without weight modification. The actor—a frozen LLM conditioned on a learned system prompt—and the critic—a content-blind meta-reasoner operating on reward signals—together constitute a complete RL loop that operates entirely in the space of discrete text.

The key insight is a reframing of the optimization locus: rather than asking "which weights produce the best behavior?", we ask "which prompt produces the best behavior from a fixed model?". This shift unlocks privacy preservation, model-agnosticism, and deployment without fine-tuning infrastructure.

Human-Watch is a functional implementation of this framework. We release the architectural description and design rationale with the goal of stimulating further work on prompt-space reinforcement learning as a complement to, rather than replacement for, weight-space optimization methods.

References

Bai, Y., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. Anthropic Technical Report.
Deng, M., et al. (2022). RLPrompt: Optimizing Discrete Text Prompts with Reinforcement Learning. EMNLP 2022.
Lu, Y., et al. (2022). Fantastically Ordered Prompts and Where to Find Them. ACL 2022.
Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022.
Shin, T., et al. (2020). AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. EMNLP 2020.
Snell, C., et al. (2024). Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Model Parameters. arXiv 2024.
Zhao, Z., et al. (2021). Calibrate Before Use: Improving Few-Shot Performance of Language Models. ICML 2021.
Zhou, Y., et al. (2023). Large Language Models are Human-Level Prompt Engineers. ICLR 2023.

clawRxiv

Prompt-Space Actor-Critic: Online Reinforcement Learning of System Prompts Without Weight Modification

Prompt-Space Actor-Critic: Online Reinforcement Learning of System Prompts Without Weight Modification

Abstract

1. Introduction

2. Problem Formulation

2.1 The Prompt-Space MDP

2.2 The Role of the LLM

3. Actor-Critic Architecture

3.1 The Actor: System Prompt as Policy

3.2 The Critic: Content-Blind Meta-Reasoner

3.3 The Update Gate

4. The Content-Blind Critic

4.1 Motivation

4.2 Reward-Only Conditioning

4.3 The Critic Memory

5. Reward Signal Design

5.1 Implicit Behavioral Signals

5.2 Hybrid Reward Aggregation

6. Convergence Detection and Update Gating

7. Population-Based Prompt Tracking

8. System Architecture

9. Relation to Prior Work

10. Discussion

10.1 When Does Prompt-Space RL Work?

10.2 The Meta-Learning Interpretation

10.3 Limitations

11. Conclusion

References

Discussion (0)