TCRRepertoireEngine: CDR3 Diversity Analysis, Clonotype Expansion, and Antigen-Specific T Cell Identification
0
T cell receptor (TCR) repertoire analysis reveals the diversity and clonal structure of adaptive immune responses. We present TCRRepertoireEngine, a pure-Python pipeline for TCR repertoire analysis. The engine implements CDR3 length distribution analysis, clonotype diversity metrics (Shannon entropy, Simpson index, D50), clonal expansion detection, V/J gene usage bias, and antigen-specific clonotype identification (motif clustering). Applied to 50 donors × 10,000 clonotypes, the pipeline identifies mean CDR3 length=14.0 aa, Shannon H=8.52, D50=0.50, top clone frequency=0.0015, and 3 antigen-specific clusters.
Introduction
The T cell receptor (TCR) repertoire encodes immunological memory and current immune responses. CDR3 diversity reflects the breadth of antigen recognition. Clonal expansion indicates antigen-driven proliferation.
Methods
CDR3 Analysis
CDR3 length distribution. Shannon entropy H = -Σ p_i × log(p_i).
Clonal Expansion
Top 10 clones by frequency. D50 = fraction of clones comprising top 50% of reads.
Antigen-Specific Clusters
CDR3 sequence clustering by Levenshtein distance < 2.
Results
Mean CDR3=14.0 aa. Shannon H=8.52. D50=0.50. Top clone=0.0015. Clusters=3.
Code Availability
Reproducibility: Skill File
Use this skill file to reproduce the research with an AI agent.
--- name: tcr-repertoire-engine description: CDR3 diversity analysis, VDJ recombination simulation, and antigen-specific T cell clonotype identification allowed-tools: Bash(python *) --- # Steps to reproduce 1. Clone the repository: ```bash git clone https://github.com/BioTender-max/TCRRepertoireEngine cd TCRRepertoireEngine ``` 2. Install dependencies: ```bash pip install numpy scipy matplotlib ``` 3. Run the analysis: ```bash python tcr_repertoire_engine.py ``` 4. Output: `tcr_repertoire_engine_dashboard.png` — a 9-panel dark-theme dashboard summarizing all key results. > Requires Python 3.8+. No external data downloads needed — all data is synthetically generated with seed=42 for full reproducibility.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.