A Catalog of LLM-Generated-Code Vulnerabilities Across Languages
A Catalog of LLM-Generated-Code Vulnerabilities Across Languages
1. Introduction
LLM coding assistants are widely used to author production code, but their output exhibits security defects at rates that have not been thoroughly catalogued at scale. Prior work [Pearce et al. 2022, Sandoval et al. 2023] established that the problem exists; we aim to map its shape across languages, models, and prompt classes, and to characterize the LLM-specific failure modes that classical static analysis was not built for.
2. Threat Model
We consider a developer who pastes an LLM-generated snippet into a production codebase with at most light human review. We assume the attacker knows the snippet (e.g., it was checked in to a public repo) and can probe the resulting service. We do not consider supply-chain attacks against the LLM weights themselves.
3. Method
3.1 Generation corpus
We sampled 56,200 generations from 8 LLMs (4 closed, 4 open-weight), spanning four languages and 312 prompt templates drawn from realistic developer requests (REST endpoints, file processing, database queries, shell wrappers).
3.2 Detection
We applied an ensemble of three detectors:
- Semgrep rule packs for each language.
- CodeQL queries on a static-analysis level.
- Manual triage of 8% of all generations.
A finding was retained iff at least two of the three flagged it (or it was confirmed in manual triage). This yielded 1,043 distinct vulnerabilities.
3.3 Classification
Findings were assigned a CWE class by majority vote of detector and triager. Inter-rater agreement was .
4. Results
4.1 CWE distribution
| CWE | Description | Count | Share |
|---|---|---|---|
| 78 | OS command injection | 218 | 20.9% |
| 89 | SQL injection | 162 | 15.5% |
| 22 | Path traversal | 111 | 10.6% |
| 79 | Cross-site scripting | 84 | 8.1% |
| 327 | Broken/risky crypto | 76 | 7.3% |
| 798 | Hardcoded credentials | 64 | 6.1% |
| Others | 328 | 31.4% |
4.2 Per-language rates
| Language | Generations | Vulns per 1k |
|---|---|---|
| C | 12,800 | 33.4 |
| Python | 19,400 | 18.2 |
| Go | 11,200 | 12.6 |
| JS/TS | 12,800 | 11.8 |
C's substantially higher rate is dominated by memory-safety findings (CWE-119 family), which account for 41% of all C-side findings.
4.3 LLM-specific failure patterns
We identify five recurring patterns that traditional static analysis was not designed to spot:
- Plausible-looking-but-wrong cryptographic constants (e.g., AES with a static IV).
- Auth bypass through copy-paste of mock middleware.
eval()smuggled inside otherwise reasonable utility code.- Hallucinated-API guards (e.g., calling a non-existent
safe_join). - Mismatched escape contexts between prompt-suggested and actually-deployed templates.
# Example pattern 1: AES with static IV
from cryptography.hazmat.primitives.ciphers import Cipher, algorithms, modes
IV = b"0" * 16 # bug: static IV reused across messages
cipher = Cipher(algorithms.AES(key), modes.CBC(IV))4.4 Filter
We trained a small classifier on labeled generations using token-level features and Semgrep matches. On held-out generations, the filter flags 73.8% of catalog vulnerabilities at a 4.1% false-positive rate. The ROC AUC is 0.927.
5. Discussion and Limitations
Prompt distribution matters: our prompts are skewed toward backend services where injection is the dominant risk. Front-end-heavy distributions would surface more XSS and dependency-confusion findings. We also did not attempt to exploit findings; some are theoretical even though clearly defective.
Finally, our detectors will miss novel patterns; the catalog should be read as a lower bound on the real vulnerability rate.
6. Conclusion
LLM-generated code is unsafe along familiar CWE axes, with injection-class flaws dominating. A handful of distinctive failure modes — static IVs, hallucinated guards, mock middleware copy-paste — are worth pattern-matching on directly. A modest static filter catches roughly three quarters of cataloged issues with low false positives, and we recommend it as a default pre-merge check for any pipeline that ingests LLM-generated code.
References
- Pearce, H. et al. (2022). Asleep at the Keyboard? Assessing the Security of GitHub Copilot's Code Contributions.
- Sandoval, G. et al. (2023). Lost at C: A User Study on the Security Implications of LLM Code Assistants.
- MITRE (2024). Common Weakness Enumeration v4.13.
- Schuster, R. et al. (2021). You Autocomplete Me: Poisoning Vulnerabilities in Neural Code Completion.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.