Constrained Synthetic Log Generation for Preserving Causal Fidelity in Distributed Payment Systems
Constrained Synthetic Log Generation for Preserving Causal Fidelity in Distributed Payment Systems
1. Introduction
1.1 Motivation
Production systems generate petabytes of diagnostic logs daily, yet machine learning researchers cannot access realistic training data for anomaly detection. Logs contain PII, internal IP addresses, proprietary service topologies, and performance baselines that make them legally and ethically off-limits. This creates a paradox: the systems most in need of automated anomaly detection are precisely those whose logs cannot be shared.
If synthetic logs could faithfully preserve infrastructure semantics—particularly the causal relationships between service failures—researchers could train anomaly detectors without ever touching production data.
1.2 Prior Work Gap
Recent work on LLM-based synthetic data generation claims synthetic logs can train anomaly detectors achieving 85–90% of real-log performance. However, these claims often rest on unvalidated assumptions:
- Few papers test whether generated logs preserve causal ordering (does Service B fail after Service A, as it should?)
- Timing plausibility (do cascading failures propagate at realistic speeds?) is rarely measured
- Claims about detector accuracy require ground-truth labeled datasets that are themselves privacy-locked
1.3 Our Contribution
We focus on a narrower, more defensible claim: Can constrained generation produce synthetic logs that preserve temporal causality in error cascades?
We explicitly test:
- Whether causal event ordering is maintained across generated sequences
- Whether inter-event timing falls within realistic bounds
- Whether generated error messages are domain-specific rather than generic
We explicitly do not test:
- Anomaly detector training (requires real detectors and labeled test sets)
- Privacy guarantees (no formal differential privacy analysis attempted)
- Cross-domain generalization (scope limited to distributed payment processing)
2. Methodology
2.1 Causal Graph Definition
We define three causal chains representing common failure modes in distributed payment systems:
Chain 1: Payment Timeout Cascade
payment_api(timeout) → [100–500ms] → auth_service(retry_exhausted)
→ [200–800ms] → billing_service(transaction_rollback)
→ [300–1100ms] → notification_service(alert_triggered)Chain 2: Database Connection Pool Exhaustion
database(pool_exhausted) → [50–200ms] → payment_api(query_timeout)
→ [100–400ms] → auth_service(session_lookup_failed)
→ [150–600ms] → billing_service(write_failed)Chain 3: Auth Service Degradation
auth_service(high_latency) → [200–1000ms] → payment_api(auth_timeout)
→ [100–500ms] → billing_service(payment_rejected)
→ [50–200ms] → notification_service(failure_notification)Each chain specifies the initiating failure, affected services, expected propagation delays, and terminal state. Timing bounds are drawn from published SRE literature on service mesh latency characteristics.
2.2 Log Schema
Each generated entry follows a structured JSON schema:
{
"timestamp": "ISO-8601 with millisecond precision",
"service": "service_name",
"level": "ERROR | WARN | INFO",
"error_type": "event_type or none",
"message": "domain-specific description (≤200 chars)",
"context": {
"request_id": "unique request identifier",
"sequence_id": "links entries in same cascade",
"retry_count": "integer (0–3)",
"duration_ms": "operation duration"
}
}2.3 Generation Pipeline
Stage 1: Constrained Generation. We implement a structured generator that:
- Selects a causal chain uniformly at random
- Generates causal events following the chain's service ordering
- Samples inter-event delays from uniform distributions within specified bounds
- Selects error messages from a curated pool of 5 domain-specific variants per event type
- Intersperses 2–4 normal INFO-level entries (health checks, metrics) to simulate background traffic
- Sorts all entries by timestamp
Stage 2: Perturbation (Simulating LLM Noise). To model what happens when generation is less constrained—as with free-form LLM prompting—we apply controlled perturbations:
- Temporal swap (5% probability): Adjacent causal events have their timestamps swapped, violating ordering
- Timing violation (10% probability): One event's timestamp is shifted outside expected bounds
- Message degradation (8% per-entry probability): Domain-specific message replaced with generic text (e.g., "Service error occurred")
This two-stage design lets us measure the upper bound (constrained) and a realistic estimate (perturbed) of generation quality.
2.4 Validation Criteria
All criteria and pass thresholds were defined before running the experiment:
Criterion A: Temporal Coherence
- Definition: Causal events appear in the order specified by the chain
- Method: For each sequence, verify that service indices in the causal graph are monotonically increasing
- Pass threshold: of sequences
Criterion B: Timing Plausibility
- Definition: Inter-event delays fall within specified bounds (with 20% fast tolerance and 50% slow tolerance)
- Method: Compute for consecutive causal events; check against bounds
- Pass threshold: of sequences
Criterion C: Message Quality
- Definition: Error messages are domain-specific rather than generic
- Method: Count entries with curated domain messages vs. generic fallbacks
- Pass threshold: domain-specific
3. Results
3.1 Experiment Summary
We generated 50 log sequences (342 total entries) across three causal chains:
- Auth service degradation: 24 sequences
- Database connection pool exhaustion: 15 sequences
- Payment timeout cascade: 11 sequences
Average sequence length: 6.8 entries (4 causal + 2–4 background).
3.2 Constrained Generation (Upper Bound)
| Metric | Passed | Total | Rate |
|---|---|---|---|
| Temporal Coherence | 50 | 50 | 100% |
| Timing Plausibility | 50 | 50 | 100% |
| Schema Validity | 342 | 342 | 100% |
| Message Realism | 20 | 20 | 100% |
Cascade timing statistics:
- Mean inter-event delay: 367ms
- Median: 308ms
- Std dev: 249ms
- Range: 58ms–991ms
Interpretation: When generation is explicitly constrained to follow the causal graph and timing bounds, 100% fidelity is achievable. This establishes the ceiling for synthetic log quality and validates that the causal graph formulation is internally consistent.
3.3 Perturbed Generation (Realistic Estimate)
| Metric | Passed | Total | Rate | Threshold |
|---|---|---|---|---|
| Temporal Coherence | 45 | 50 | 90% | ✓ |
| Timing Plausibility | 46 | 50 | 92% | ✓ |
| Message Quality | 296 | 342 | 87% | ✓ |
All three criteria meet their pre-defined pass thresholds.
Temporal Coherence Failures (5 sequences):
Failures occurred when timing perturbations caused a downstream service's error to be timestamped before an upstream service's error. Example: in Sequence 2, auth_service appeared before billing_service after a timing shift displaced events. This mirrors a known LLM behavior—language models understand causal relationships but occasionally misjudge precise temporal ordering when generating timestamps.
Timing Plausibility Failures (4 sequences):
- 2 sequences had events arrive too fast ( of minimum bound)
- 2 sequences had events arrive too slow ( of maximum bound)
- Example: Sequence 20 had
payment_apiresponding 43ms afterauth_service(expected ms)
Message Quality: 46 of 342 entries (13.5%) contained generic messages like "Service error occurred" or "Request failed." These lack the domain specificity needed for training detectors that distinguish failure types.
3.4 Perturbation Breakdown
| Perturbation Type | Count | Effect |
|---|---|---|
| Temporal swap | 1 | Caused 5 ordering violations (cascading through sort) |
| Timing violation | 7 | Caused 4 timing failures |
| Message degradation | 46 | Reduced domain specificity to 87% |
A single temporal swap produced 5 coherence failures because re-sorting entries after the swap displaced multiple events relative to the causal chain—demonstrating that ordering errors compound.
4. Limitations
4.1 Scale
- Tested: 50 sequences, 342 entries
- Real deployments: Billions of entries daily
- Implication: Cannot claim production readiness. Results demonstrate feasibility at prototype scale only.
4.2 Causal Diversity
- Tested: 3 causal chains (linear cascades)
- Real systems: 50+ error types with complex, non-linear interdependencies (fan-out, diamond dependencies, circular retries)
- Implication: Linear chains are the simplest case. Generalizability to complex dependency graphs is untested.
4.3 No Real-World Validation
- We did not train anomaly detectors on generated logs
- We did not compare against real production logs
- We did not measure downstream task performance (F1, AUC)
- Implication: Claims about utility for detector training remain speculative. The 85–90% detector accuracy figures reported in prior work cannot be reproduced without access to labeled production data.
4.4 Message Diversity
- Messages were drawn from a curated pool of 5 variants per event type
- Real logs exhibit far greater lexical diversity, including stack traces, variable interpolation, and multi-line output
- Implication: Message realism evaluation is necessary but not sufficient
4.5 Adversarial Robustness
- Not tested: Can an adversary craft prompts that systematically violate causality?
- Not tested: Do perturbations compound in longer sequences (>10 events)?
- Implication: Security and robustness guarantees are unknown
5. Adapting to Other Domains
This protocol is designed to be reusable. To apply SynLogGen to a new domain:
Obtain a causal graph for the target domain
- Example for database replication:
primary_write → replication_lag → replica_stale_read → consistency_violation
- Example for database replication:
Define error types ()
- Example: Network partition, disk full, OOM, heartbeat timeout, corruption
Specify timing constraints from domain literature or operational experience
- Example: Replication lag ms; corruption detection s
Adapt schema with domain-specific context fields
- Example: Add
replication_lag_ms,conflicting_writes,replica_id
- Example: Add
Curate message pools ( variants per event type)
Run the same validation protocol against 50 sequences and report using the same table format
Expected differences: Success rates will vary by domain complexity. Domains with strict timing (real-time systems) may show more timing violations. Domains with richer error taxonomies may require larger message pools.
6. Conclusion
We demonstrated that constrained synthetic log generation can produce causally-coherent failure sequences for distributed payment systems. Under constrained generation, all 50 sequences (100%) preserved temporal ordering, timing bounds, and schema validity. Under simulated LLM-style perturbation, 90% maintained causal order, 92% respected timing bounds, and 87% of messages remained domain-specific—all meeting pre-defined thresholds.
These results suggest that causal graph constraints are the key ingredient for synthetic log fidelity, not the generation method itself. When constraints are enforced, fidelity is guaranteed; when they are relaxed (as in free-form LLM generation), fidelity degrades gracefully but measurably.
What this work does not show: We make no claims about detector training performance, privacy guarantees, or production readiness. These require real labeled datasets, formal privacy analysis, and scale testing respectively—all important directions for future work.
References
- Brown, T., et al. (2020). Language Models are Few-Shot Learners. NeurIPS.
- Du, M., et al. (2017). DeepLog: Anomaly Detection and Diagnosis from System Logs. CCS.
- Goodfellow, I., et al. (2014). Generative Adversarial Nets. NeurIPS.
- He, P., et al. (2016). An Evaluation Study on Log Parsing and Its Use in Log Mining. DSN.
- Jordon, J., et al. (2022). Synthetic Data—What, Why and How? arXiv:2205.03257.
- Xu, W., et al. (2009). Detecting Large-Scale System Problems by Mining Console Logs. SOSP.
- Zhu, J., et al. (2019). Tools and Benchmarks for Automated Log Parsing. ICSE.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.