ImmunRepertoire: Pure Python TCR/BCR Immune Repertoire Analysis Engine
ImmunRepertoire: Pure Python TCR/BCR Immune Repertoire Analysis Engine
Abstract
We present ImmunRepertoire, a complete immune repertoire analysis pipeline implemented entirely in Python using only NumPy, SciPy, pandas, and matplotlib. ImmunRepertoire provides six analysis modules — CDR3 analysis, V/D/J gene usage, clonotype definition, diversity metrics, public clonotype detection, and visualization — without requiring TRUST4, MiXCR, VDJtools, immunarch, or any other external compiled binaries or R packages. The entire pipeline runs on CPU and produces a 6-panel PNG dashboard. We demonstrate on synthetic TRB repertoire data (500 clonotypes, 5,000 cells, 3 samples), recovering realistic diversity metrics and identifying public clonotypes.
Background
Immune repertoire sequencing (Rep-seq) profiles the diversity of T-cell receptor (TCR) and B-cell receptor (BCR) sequences in a sample. The CDR3 region — the hypervariable loop formed by V(D)J recombination — determines antigen specificity. Repertoire analysis quantifies clonal diversity, identifies expanded clones (indicative of antigen-driven responses), and detects public clonotypes shared across individuals (convergent recombination). Applications span tumor immunology, autoimmune disease, vaccine response, and infectious disease.
Methods
CDR3 Analysis
Length distribution computed over unique clonotypes. Amino acid composition compared to background proteome frequencies. Mean CDR3 length and standard deviation reported per chain type (TRA: ~12 AA, TRB: ~13 AA, IGH: ~15 AA).
V/D/J Gene Usage
Clone-level (not cell-level) gene usage frequencies computed to avoid expansion bias. V gene usage reflects thymic selection and antigen exposure history. J gene usage is more uniform but shows disease-specific skewing in autoimmune conditions.
Clonotype Definition
Two methods supported:
- Exact: identical CDR3 amino acid sequence + V gene
- Hamming: single-linkage clustering of same-length CDR3s at normalized Hamming distance ≤ 0.15, capturing near-identical clonotypes from somatic hypermutation
Diversity Metrics
All metrics computed on clone-level frequency distribution :
| Metric | Formula | Interpretation |
|---|---|---|
| Shannon entropy | Overall diversity | |
| Normalized Shannon | 0=monoclonal, 1=uniform | |
| Clonality | 0=diverse, 1=monoclonal | |
| Gini coefficient | Clone size inequality | |
| D50 | Clones covering 50% | |
| Simpson index | Probability of same clone |
Public Clonotype Detection
CDR3 amino acid sequences shared across ≥2 samples identified by exact string matching. Public clonotypes arise from convergent V(D)J recombination driven by shared antigen exposure or structural constraints on CDR3 sequence space.
Results
On synthetic TRB repertoire (n=500 clonotypes, 5,000 cells, 20 expanded clones, seed=42):
| Metric | Value |
|---|---|
| Richness | 500 |
| Shannon Entropy | 4.84 |
| Normalized Shannon | 0.77 |
| Clonality | 0.22 |
| Gini Coefficient | 0.66 |
| D50 | 13 |
| Simpson Index | 0.0245 |
| Top 1 Clone | 5.8% |
| Top 10 Clones | 44.9% |
| CDR3 Mean Length | 12.4 ± 2.8 AA |
| Public Clonotypes | 25 (5.0%) |
| Runtime | <10s CPU |
The Gini coefficient of 0.66 and D50 of 13 indicate moderate clonal expansion consistent with an antigen-experienced repertoire. The top 10 clones account for 44.9% of the repertoire, reflecting the power-law distribution of clone sizes.
Availability
GitHub: https://github.com/junior1p/ImmunRepertoire
Discussion
ImmunRepertoire fills a gap for researchers who need a reproducible, dependency-free immune repertoire analysis stack. By implementing all algorithms in pure NumPy/SciPy, the pipeline is fully auditable, easily containerizable, and runs without compilation or environment conflicts.
Key design decisions: (1) clone-level (not cell-level) gene usage avoids expansion bias; (2) both exact and Hamming-distance clonotype definitions are supported; (3) public clonotype injection in synthetic data simulates convergent recombination.
Limitations: the current implementation requires pre-processed CDR3 sequences (no raw FASTQ alignment). Integration with TRUST4 or MiXCR output formats is planned. The Hamming clustering is O(n²) per length group and may be slow for >10,000 clonotypes.
Conclusion
ImmunRepertoire provides a complete, pure-Python immune repertoire analysis toolkit covering CDR3 profiling, gene usage, diversity metrics, and public clonotype detection. The pipeline achieves sub-10-second runtime on CPU while eliminating external dependencies, making it suitable for AI agent workflows and reproducible research environments.
Reproducibility: Skill File
Use this skill file to reproduce the research with an AI agent.
---
name: immunrepertoire
description: >
ImmunRepertoire: Pure Python TCR/BCR immune repertoire analysis engine.
Use for: CDR3 analysis, V/D/J gene usage, clonal diversity (Shannon, Gini, D50),
public clonotype detection, clonal expansion profiling.
Triggers on: "immune repertoire", "TCR", "BCR", "CDR3", "V(D)J", "clonotype",
"clonal expansion", "repertoire diversity", "MiXCR", "VDJtools", "immunarch",
"TRUST4", "Shannon entropy", "Gini", "D50", "public clonotype".
allowed-tools: Bash(python3 *), Bash(pip install *), Bash(git clone *), Bash(curl *)
---
# ImmunRepertoire — Pure Python Immune Repertoire Analysis
> **Reviewer contract**: Every number in the research note is produced by the steps below.
> Run them in order. Each step prints its key output to stdout.
## Step 1 — Install dependencies
```bash
pip install numpy scipy pandas matplotlib
```
Expected: no errors. All packages available in standard Python environments.
## Step 2 — Clone the repository
```bash
git clone https://github.com/junior1p/ImmunRepertoire.git
cd ImmunRepertoire
```
## Step 3 — Run the pipeline (reproduces all paper numbers)
```bash
python3 immunrepertoire.py \
--chain TRB \
--n-clonotypes 500 \
--n-cells 5000 \
--n-expanded 20 \
--n-samples 3 \
--out-dir immunrepertoire_output \
--seed 42
```
**Expected output:**
```
[ImmunRepertoire] ✓ Analysis complete.
Richness: 500 clonotypes
Shannon entropy: 4.8365
Clonality: 0.2218
Gini coefficient: 0.6589
D50: 13
Public clonotypes: 25 (5.0%)
CDR3 mean length: 12.4 ± 2.8 AA
```
## Step 4 — Verify output files
```bash
ls immunrepertoire_output/
# Expected: clonotypes.csv v_gene_usage.csv j_gene_usage.csv
# public_clonotypes.csv diversity_metrics.csv
# summary.json immunrepertoire_dashboard.png
```
## Step 5 — Run with IGH chain (generalizability check)
```bash
python3 immunrepertoire.py \
--chain IGH \
--n-clonotypes 300 \
--n-cells 3000 \
--n-samples 4 \
--out-dir immunrepertoire_igh \
--seed 99
```
**Expected:** Richness=300, Shannon>4.0, Clonality<0.35, runtime <15s.
## Python API
```python
from immunrepertoire import run_immunrepertoire
summary = run_immunrepertoire(
out_dir="output",
chain="TRB", # TRA | TRB | IGH
n_clonotypes=500,
n_cells=5000,
n_expanded=20,
n_samples=3,
clonotype_method="exact", # exact | hamming
rng_seed=42,
)
print(summary["diversity"])
```
## Output Files
```
output/
├── clonotypes.csv # unique clonotypes: cdr3_aa, v_gene, d_gene, j_gene, count
├── v_gene_usage.csv # V gene frequency table
├── j_gene_usage.csv # J gene frequency table
├── public_clonotypes.csv # CDR3s shared across ≥2 samples
├── diversity_metrics.csv # Shannon, Gini, D50, Simpson, clonality
├── summary.json # full pipeline summary
└── immunrepertoire_dashboard.png # 6-panel visualization
```
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.