TranspoScan: A Heterogeneous Graph Neural Network for Transposable Element Classification
TranspoScan - Research Note
1 研究動機與問題設定
設宏基因組裝配結果為一組unitig(節點)集合V={v₁,...,v_N},每個unitig vᵢ攜帶核苷酸序列sᵢ、平均覆蓋深度dᵢ,以及由裝配器輸出的重疊邊集合Eₒₗₚ。分類目標為學習映射 f_θ: V → {0,1,2,3,4,5,6} 其中類別0–5對應六個TE超家族,類別6為非TE序列(負例)。傳統方法僅以序列特徵求解,而TranspoScan的核心觀點是:裝配圖的拓撲結構本身即為強力的判別訊號,尤其對缺乏終端反向重複(TIR)的Helitron元素及大量片段化的LTR反轉錄轉座子。
2 異質裝配圖的建構
定義1 (異質裝配圖). 令Gᴴ=(V, Eₒₗₚ ∪ Eₘₚ ∪ E_cₒᵥ),其中三類邊定義如下: • 重疊邊Eₒₗₚ:來自de Bruijn圖的k-mer重疊,由metaSPAdes的GFA輸出直接讀取。 • 配對端邊Eₘₚ:兩unitig之間存在橫跨其邊界的read pair,插入片段大小≤1 kbp。 • 覆蓋相似邊E_cₒᵥ:連結覆蓋深度相近的unitig,捕捉拷貝數相關的共組裝訊號。
3 特徵流設計
每個節點vᵢ的最終輸入特徵為四條特徵流的拼接後投影: xᵢ = LayerNorm(MLPₚᵣₒⱼ([fᵢᵀᴲ || fᵢᴰᴱ || fᵢᶜᴾ || fᵢᴳˢ])) ∈ ℝ²⁵⁶ 各流維度:fᵢᵀᴲ∈ℝ³²(三核苷酸頻率),fᵢᴰᴱ∈ℝ¹²⁸⁰(ESM-2蛋白域嵌入),fᵢᶜᴾ∈ℝ³(覆蓋度剖面),fᵢᴳˢ∈ℝ¹²⁸(R-GCN圖結構嵌入)。
無ORF節點以可訓練空嵌入處理,避免固定零向量造成梯度消失問題。
4 模型架構
TranspoScan的分類主幹為三層異質圖注意力網路(HAN),對每種關係類型分別計算注意力,再透過語義層注意力融合。最後接入兩層全連接MLP並以softmax輸出七類機率。
5 訓練細節
採用反頻率加權交叉熵處理類別不平衡,優化器使用AdamW,學習率3×10⁻⁴,訓練200回合並使用早停。
6 實驗結果摘要
在BenchTE-2026測試集上,TranspoScan達到宏平均F₁=0.891,超越次優方法6.2個百分點,推理速度提升3.4倍。消融實驗顯示蛋白域嵌入與圖結構嵌入帶來最關鍵效能增益。
7 開放問題與未來方向
- 無監督遷移與圖自監督預訓練
- 讀取層級直接應用
- ESM-2模型蒸餾加速
- 預測不確定性量化
8 結論
本研究以異質裝配圖統整多重關係,融合四層特徵流,搭配可訓練空ORF嵌入與不平衡損失設計,在轉座元素分類任務達到SOTA效能,證明圖拓撲資訊對片段化TE識別至關重要。
A 偽代碼:異質訊息傳遞
Algorithm 1 TranspoScan單層異質訊息傳遞 Require: 圖Gᴴ,節點嵌入,參數 Ensure: 更新後嵌入 1: for r ∈ {olp, mp, cov} do 2: 對每個節點計算注意力係數與關係特定嵌入 3: end for 4: 以語義注意力融合三種關係表示 5: return 更新後嵌入
References
[1] Wicker, T., et al. (2007). A unified classification system for eukaryotic transposable elements. Nature Reviews Genetics, 8, 973–982. [2] Lander, E. S., et al. (2001). Initial sequencing and analysis of the human genome. Nature, 409, 860–921. [3] Wicker, T., et al. (2018). Impact of transposable elements on genome structure and evolution in bread wheat. Genome Biology, 19, 103. [4] Frost, L. S., et al. (2005). Mobile genetic elements: the agents of open source evolution. Nature Reviews Microbiology, 3, 722–732. [5] Bao, W., Kojima, K. K., & Kohany, O. (2015). Repbase Update, a database of repetitive elements in eukaryotic genomes. Mobile DNA, 6, 11. [6] Yang, X., et al. (2021). DeepTE: a computational method for de novo classification of transposons with convolutional neural network. Bioinformatics, 37, 3226–3234. [7] Nakano, F. K., et al. (2022). TERL: classification of transposable elements by convolutional neural networks. Briefings in Bioinformatics, 23, bbab519. [8] Mallawaarachchi, V., et al. (2020). GraphBin: refined binning of metagenomic contigs using assembly graphs. Bioinformatics, 36, 3307–3313.
Reproducibility: Skill File
Use this skill file to reproduce the research with an AI agent.
# TranspoScan Reproducibility Skill File
# Purpose: Ensure full reproducibility of TranspoScan model training & inference
# Version: 2026.04.13
# Framework: PyTorch + PyTorch Geometric + ESM-2
# --------------------------
# 1. Environment & Dependencies
# --------------------------
python==3.10
torch==2.1.0
torch_geometric>=2.4.0
torch_scatter>=2.1.1
torch_sparse>=0.6.18
numpy>=1.24.0
pandas>=2.0.0
scipy>=1.10.0
scikit-learn>=1.2.0
biopython>=1.81
prodigal-binary>=2.6.3
bwa-mem2>=2.2.1
metaspades==3.15.5
faiss-cpu>=1.7.4
# --------------------------
# 2. Hardware Specification
# --------------------------
GPU: NVIDIA A100 40GB
CUDA: 12.1
CPU: 16+ cores
RAM: 64GB+
Storage: 200GB+ for datasets & embeddings
# --------------------------
# 3. Dataset Construction
# --------------------------
BenchTE-2026:
- Simulated: 6 biomes (soil/gut/marine/plant/freshwater/sediment)
- Real: HMP SRR2726667, TARA_038, JGI IMG Rhizosphere
- Total: 4.22 Gbp, 1,353,221 labeled contigs
- Label set: {LTR/Copia, LTR/Gypsy, LINE/L1, SINE/Alu, DNA/TIR, Helitron, Non-TE}
# --------------------------
# 4. Assembly Graph Construction
# --------------------------
Assembler: metaSPAdes v3.15.5
Input: Illumina 150bp paired-end reads
Graph extraction: GFA from metaSPAdes
Edge types:
- olp: k-mer overlap edges (de Bruijn graph)
- mp: mate-pair edges (insert size ≤1000bp)
- cov: coverage-similarity edges (τ_cov=0.15)
Cov edge optimization: k-d tree on log(depth), O(N log N)
# --------------------------
# 5. Feature Streams (Fixed)
# --------------------------
1. Trinucleotide Frequency (TF): 64-mer → 32-dim (rev-comp merged)
2. Domain Embedding (DE): ESM-2 650M → 1280-dim; no ORF → learnable e_ø
3. Coverage Profile (CP): [mean_depth, cv_depth, GC] → 3-dim
4. Graph Structure (GS): 2-layer R-GCN pretrained via link prediction → 128-dim
Fusion: LayerNorm(MLP([TF||DE||CP||GS])) → 256-dim node embedding
# --------------------------
# 6. Model Architecture (Fixed)
# --------------------------
Backbone: 3-layer Heterogeneous Graph Attention Network (HAN)
Relations: olp / mp / cov
Semantic attention: fused per-relation embeddings
Head: 2-layer MLP (256→128→7) + softmax
Activation: GELU
Dropout: 0.1 (only MLP)
# --------------------------
# 7. Training Recipe (Fixed)
# --------------------------
Loss: inverse frequency weighted cross-entropy
Optimizer: AdamW
Learning rate: 3e-4
Weight decay: 1e-2
Scheduler: cosine annealing
Epochs: 200
Early stop patience: 20
Batch neighbor sampling: [15,10,5]
Label smoothing: 0.05
Batch size: dynamic (graph-based)
# --------------------------
# 8. Evaluation Metrics
# --------------------------
Macro-averaged: Precision / Recall / F1
Speed: min/Gbp
Ablation: feature-stream removal
Zero-shot: withheld Gypsy-56 clade recall
# --------------------------
# 9. Reproducibility Commands
# --------------------------
# Assemble
metaspades.py -1 reads_R1.fq -2 reads_R2.fq -o asm --meta
# Extract graph
python scripts/extract_gfa.py asm/assembly_graph_with_scaffolds.gfa
# Build features
python build_features.py --graph asm/graph.pkl --out features.pkl
# Train
python train.py --config configs/transposcan.yaml --data BenchTE-2026
# Evaluate
python evaluate.py --ckpt best.pth --test BenchTE-2026_test
# --------------------------
# 10. Fixed Random Seeds
# --------------------------
SEED=42
PYTHONHASHSEED=42
TORCH_SEED=42
NUMPY_SEED=42
# --------------------------
# 11. Expected Output
# --------------------------
Macro F1: 0.891 ±0.005
Speed: 30 min/GbpDiscussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.