Backdoor Detection via Spectral Signatures: A Phase Transition in Trigger Detectability
Introduction
Backdoor (trojan) attacks embed a hidden trigger in a neural network's training data, causing the model to misclassify triggered inputs to an attacker-chosen target class while maintaining normal accuracy on clean inputs [gu2017badnets]. Defending against such attacks is critical for the trustworthy deployment of machine learning systems.
[tran2018spectral] proposed spectral signatures: the key insight is that poisoned samples leave a detectable trace in the model's learned representations. Specifically, the top eigenvector of the covariance matrix of penultimate-layer activations correlates with the poisoned subset, enabling detection via outlier scoring.
We systematically evaluate this method across a sweep of poison fractions, trigger magnitudes, and model sizes to characterize when spectral detection works and when it fails. Our main finding is a sharp phase transition in detectability as a function of trigger strength.
Method
Data Generation
We generate synthetic classification data with 5 Gaussian clusters in , each with 100 samples (500 total). Cluster centers are drawn from with unit-variance noise. This provides a controlled setting where the signal-to-noise ratio is known precisely.
Backdoor Injection
We select a fraction of non-target-class samples, apply a trigger pattern (setting features 0--2 to a fixed value ), and relabel them to the target class. The trigger strength is comparable to the cluster center scale (3), deliberately spanning the boundary between "within-distribution" and "out-of-distribution" perturbations.
Model Architecture and Training
We train two-layer MLPs (input Linear() ReLU Linear(5)) with hidden dimensions , using Adam (lr=0.01) for 50 epochs. Both a clean model and a backdoored model are trained per configuration.
Spectral Analysis
Following [tran2018spectral], we extract penultimate-layer activations from the backdoored model on the training data, center them, and compute the covariance matrix . We take the top eigenvector and compute outlier scores . Detection performance is measured by the AUC of these scores against the true poison labels.
Results
We run all configurations with seed 42 for full reproducibility. In our verification run, the full sweep took about 48 seconds on a CPU.
Phase Transition in Trigger Detectability
Detection AUC by trigger strength (averaged over all poison fractions and model sizes). A sharp transition occurs between strength 5.0 and 10.0.
| lcc@ |
|---|
| Trigger Strength |
| 3.0 (within-distribution) |
| 5.0 (marginal) |
| 10.0 (out-of-distribution) |
The dominant factor is trigger strength (Table), but the strongest trigger is only reliably detectable once the poisoned subset is large enough. At strength 10.0, the 9 experiments with poison fraction all achieve AUC (6 of them exactly AUC = 1.0), while the 3 experiments at 5% poison remain hard to detect (AUC 0.166--0.281). At strengths 3.0 and 5.0, all 24 experiments have AUC , indicating spectral detection performs worse than random (the method incorrectly assigns higher scores to clean samples).
Poison Fraction and Model Size
Detection AUC by poison fraction (averaged over trigger strengths and model sizes).
| lcccc@ |
|---|
| Poison % |
| 5% |
| 10% |
| 20% |
| 30% |
Poison fraction has a secondary effect (Table): the mean AUC increases from 0.219 (5%) to 0.497 (30%), but this is driven entirely by the strong-trigger subgroup. Within the strength-10.0 subgroup, mean AUC jumps from 0.226 at 5% poison to 1.000, 1.000, and 0.996 at 10%, 20%, and 30% poison respectively. Within the weaker trigger settings, poison fraction has only a modest effect and never rescues detection above AUC 0.5.
Model size (hidden dim 64/128/256) shows no significant effect: mean AUCs are 0.433, 0.371, and 0.453 respectively, with overlapping standard deviations.
Spectral Gap
The eigenvalue ratio (top / second eigenvalue) is small at 5--10% poison (1.25 and 1.21) and increases more clearly at 20--30% poison (1.44 and 1.55). This suggests an overall strengthening of the spectral signal at higher poison fractions, but not strict monotonic growth at every step. Moreover, a larger spectral gap alone is not sufficient for detection if the poisoned samples do not separate clearly along the top eigenvector.
Attack Effectiveness
The backdoor attack achieves near-perfect success rate (mean 1.000) across all configurations, confirming that even weak triggers suffice to implant functional backdoors. The backdoored models maintain perfect accuracy on clean test data (mean 1.000), making the attack stealthy from a performance standpoint.
Discussion
Phase transition. The sharp transition between detectable and undetectable triggers has practical implications, but it is a joint threshold rather than a trigger-strength-only threshold: an adaptive adversary who constrains trigger perturbations to stay within the natural data distribution, or who keeps the poisoned subset very small, can evade spectral defenses. Our results suggest that the spectral method is a necessary but not sufficient defense.
Limitations. (1) Our synthetic data is lower-dimensional and cleaner than real-world settings; image datasets may have different spectral properties. (2) We use a simple fixed-value trigger; learned or distributed triggers may behave differently. (3) We evaluate only the penultimate layer; other layers or multi-layer analysis may improve detection. (4) We do not test other defenses (activation clustering, Neural Cleanse) for comparison. (5) We report a single random seed, so cross-seed variance remains to be measured.
Reproducibility. All experiments use seed 42, CPU-only PyTorch, and synthetic data. In our verification run, the full 36-configuration sweep completed in about 48 seconds, and the end-to-end test suite plus sweep completed in under 2 minutes. Code, data generation, and analysis are provided as an executable SKILL.md.
Conclusion
We demonstrate a phase transition in spectral backdoor detection: the method achieves near-perfect AUC when triggers exceed the natural data scale and the poison fraction reaches at least 10%, but it fails for within-distribution triggers and for the strongest trigger at only 5% poison. This highlights both the value and the limitations of spectral defenses, motivating research into detection methods robust to subtle, distribution-aligned, or low-prevalence attacks.
\bibliographystyle{plainnat}
References
[gu2017badnets] Gu, T., Dolan-Gavitt, B., and Garg, S. BadNets: Identifying vulnerabilities in the machine learning model supply chain. arXiv preprint arXiv:1708.06733, 2017.
[tran2018spectral] Tran, B., Li, J., and Madry, A. Spectral signatures in backdoor attacks. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
Reproducibility: Skill File
Use this skill file to reproduce the research with an AI agent.
---
name: backdoor-detection-spectral-signatures
description: Detect backdoor (trojan) attacks in neural networks using spectral signatures of penultimate-layer activations. Trains clean and backdoored MLPs on synthetic Gaussian cluster data, then applies eigenvalue decomposition of the activation covariance matrix to identify poisoned samples (Tran et al. 2018). Sweeps over poison fraction, trigger strength, and model size (36 experiments).
allowed-tools: Bash(git *), Bash(python *), Bash(python3 *), Bash(pip *), Bash(.venv/*), Bash(cat *), Read, Write
---
# Backdoor Detection via Spectral Signatures
This skill reproduces and extends the spectral signature method for neural network backdoor detection (Tran et al. 2018). It trains clean and trojaned two-layer MLPs on synthetic data, extracts penultimate-layer activations, and detects poisoned samples via the top eigenvector of the activation covariance matrix.
## Prerequisites
- Requires **Python 3.10+** (CPU only, no GPU needed).
- Expected runtime: **1-2 minutes**.
- All commands must be run from the **submission directory** (`submissions/backdoor-detection/`).
- No internet access required (all data is synthetically generated).
- No API keys or authentication needed.
## Step 0: Get the Code
Clone the repository and navigate to the submission directory:
```bash
git clone https://github.com/davidydu/Claw4S.git
cd Claw4S/submissions/backdoor-detection/
```
All subsequent commands assume you are in this directory.
## Step 1: Environment Setup
Create a virtual environment and install dependencies:
```bash
python3 -m venv .venv
.venv/bin/pip install --upgrade pip
.venv/bin/pip install -r requirements.txt
```
Verify all packages are installed:
```bash
.venv/bin/python -c "import torch, numpy, scipy, matplotlib; print('All imports OK')"
```
Expected output: `All imports OK`
## Step 2: Run Unit Tests
Verify the analysis modules work correctly:
```bash
.venv/bin/python -m pytest tests/ -v
```
Expected: Pytest exits with `35 passed` and exit code 0.
## Step 3: Run the Experiment Sweep
Execute the full 36-experiment parameter sweep:
```bash
.venv/bin/python run.py
```
This will:
1. Generate synthetic Gaussian cluster data (500 samples, 10 features, 5 classes)
2. For each of 36 configurations (4 poison fractions x 3 trigger strengths x 3 model sizes):
- Inject backdoor trigger (set features 0-2 to fixed values, relabel to target class)
- Train clean and backdoored 2-layer MLPs
- Extract penultimate-layer activations from the backdoored model
- Compute spectral scores via top eigenvector of activation covariance
- Measure detection AUC (ROC AUC for identifying poisoned samples)
3. Generate report and figures
Expected output: Each experiment prints its config and detection AUC. The script prints `[4/4] Generating figures...` and exits with code 0. Files created in `results/`:
- `results.json` — full experiment data
- `report.md` — markdown summary with tables
- `fig_auc_heatmap.png` — AUC heatmap (poison fraction vs trigger strength)
- `fig_auc_by_model_size.png` — AUC vs poison fraction by model size
- `fig_eigenvalue_ratio.png` — spectral gap vs poison fraction
For reproducibility, `results.json` excludes wall-clock timing fields.
## Step 4: Validate Results
```bash
.venv/bin/python validate.py
```
Expected output: `VALIDATION PASSED: All checks OK` with exit code 0. The validator checks:
- All 36 experiments completed
- AUC values in [0, 1]
- Clean model accuracy > 50% (better than random)
- All output files exist and are non-empty
- Thesis check: experiments with strong triggers (strength=10.0) and poison fraction >= 10% achieve AUC >= 0.9
## Key Parameters
| Parameter | Values | Purpose |
|-----------|--------|---------|
| Poison fraction | 5%, 10%, 20%, 30% | Fraction of training data with trigger |
| Trigger strength | 3.0, 5.0, 10.0 | Magnitude of trigger pattern |
| Hidden dim | 64, 128, 256 | Model capacity |
| Samples | 500 | Total training samples |
| Features | 10 | Input dimensionality |
| Classes | 5 | Number of target classes |
| Seed | 42 | Reproducibility |
## Expected Findings
- **Joint phase transition in detectability**: trigger strength is the dominant factor, but poison fraction matters. With `strength=10.0` and poison fraction >= 10%, all 9 such experiments achieve AUC >= 0.9 across model sizes. At 5% poison, even `strength=10.0` stays near-random in the reference run (AUC 0.166-0.281).
- Weaker triggers (`strength=3.0` or `5.0`) evade spectral detection in all 24 experiments (all AUC < 0.5).
- Poison fraction has a secondary effect: higher fractions increase the chance that a strong trigger becomes spectrally separable, but they do not rescue weak-trigger detection.
- Model size (hidden dim) has modest effect on detectability.
- The spectral gap shows an overall increase from low to high poison fractions, but not strict monotonic growth at every step.
- 9/36 experiments achieve AUC >= 0.9; they are exactly the `strength=10.0`, poison fraction >= 10% cases.
## How to Extend
1. **Different architectures**: Replace the MLP in `src/model.py` with a CNN or transformer; the spectral analysis in `src/spectral.py` works on any layer's activations.
2. **Real datasets**: Replace `generate_clean_data()` in `src/data.py` with a real dataset loader (e.g., CIFAR-10). The trigger injection logic generalizes to image patches.
3. **Other detection methods**: Add alternative detectors (e.g., activation clustering, Neural Cleanse) alongside spectral signatures in `src/spectral.py`.
4. **Adaptive attacks**: Modify `inject_backdoor()` to use learned or stealthy triggers that evade spectral detection.
5. **Statistical analysis**: Add bootstrap confidence intervals or multiple random seeds by modifying `run_sweep()` in `src/experiment.py`.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.