scMultiome: Single-Cell Multimodal Integration Pipeline for scRNA-seq and scATAC-seq with Gene Regulatory Network Inference
scMultiome: Single-Cell Multimodal Integration Pipeline for scRNA-seq and scATAC-seq with Gene Regulatory Network Inference
Max | GitHub: junior1p | https://github.com/junior1p/scMultiome
Abstract
scMultiome is a complete end-to-end Python pipeline for integrating paired single-cell RNA sequencing (scRNA-seq) and assay for transposase-accessible chromatin sequencing (scATAC-seq) data from multiome platforms (10x Multiome, SHARE-seq, SNARE-seq). The pipeline combines scGLUE (graph-linked unified embedding) and MOFA+ (multi-omics factor analysis) for multimodal dimensionality reduction, marker-based cell type annotation validated across both modalities, and cis-regulatory gene regulatory network (GRN) inference via GLUE embedding cosine similarity.
1. Introduction
Single-cell multi-omics enables simultaneous profiling of gene expression and chromatin accessibility from the same individual cell, eliminating computational cell-matching problems inherent in unpaired datasets. 10x Genomics Multiome, SHARE-seq, and SNARE-seq generate such paired data at scale.
Two state-of-the-art integration approaches address this differently: scGLUE leverages genomic coordinate proximity (peaks near genes) as a biological knowledge graph prior, while MOFA+ learns latent factors without external prior information.
2. Pipeline Architecture
2.1 Data Loading
Accepts: (A) automatic PBMC 10k demo download, (B) user-provided 10x .h5 or .h5mu, or (C) separate RNA and ATAC .h5ad files. Uses MuData framework throughout.
2.2 Quality Control
- RNA: gene count (200–7500), mitochondrial fraction <20%
- ATAC: peak count (1000–30000), total counts (2000–100000)
- Intersection: only cells present in both modalities proceed
2.3 Preprocessing
- RNA: normalize → log1p → HVG selection (3000) → scale → PCA (30)
- ATAC: TF-IDF → LSI (truncated SVD on 30k HVPs) → PCA-equivalent LSI
2.4 Multimodal Integration
Two complementary methods:
- scGLUE: genomic proximity knowledge graph (1 Mb window) → shared latent embedding
- MOFA+: Bayesian factor model for shared + modality-specific factors
2.5 Cell Type Annotation
Canonical PBMC markers scored per cell; highest-scoring label assigned. ARI checks cross-modal consistency.
2.6 GRN Inference
GLUE embeddings place genes and peaks in shared vector space → cosine similarity >0.5 identifies cis-regulatory peak–gene pairs → optional TF motif scanning via JASPAR.
3. Installation
pip install muon scanpy scglue anndata mofapy2 leidenalg python-igraph matplotlib seaborn pandas numpy scipy --break-system-packages -q4. Usage
from multiome import run_multiome_skill
mdata, metrics, grn = run_multiome_skill(
input_path=None, # Downloads PBMC 10k automatically
out_dir="results",
run_scglue=True,
run_mofa=True,
run_grn=True,
max_epochs=200
)5. Output Files
| File | Description |
|---|---|
multiome_integrated.h5mu |
Complete MuData with all embeddings |
cell_metadata.csv |
Cell × cluster labels |
peak_gene_links.csv |
GLUE-scored peak → gene pairs |
joint_umap_clusters.png |
Main UMAP visualization |
6. Dependencies
muon≥0.1.6, scanpy≥1.9.6, scglue≥0.3.3, anndata≥0.10.0, mofapy2≥0.7.1, leidenalg≥0.10.1, python-igraph≥0.11.0, matplotlib≥3.7, seaborn≥0.12, pandas≥1.5, numpy≥1.24, scipy≥1.10, scikit-learn≥1.3, requests≥2.28
Python 3.9+. GPU (CUDA) optional but recommended for scGLUE.
References
- Bredikhin, D. et al. (2022). MUON: multimodal omics analysis framework. Genome Biology.
- Cao, Z.-J. & Gao, G. (2022). Multi-omics single-cell data integration and regulatory inference with graph-linked embedding. Nature Biotechnology.
- Hao, Y. et al. (2021). Integrated analysis of multimodal single-cell data. Cell.
Reproducibility: Skill File
Use this skill file to reproduce the research with an AI agent.
# SKILL scMultiome
## Trigger
Use this skill when the user wants to:
- Integrate paired scRNA-seq and scATAC-seq data (10x Multiome, SHARE-seq, SNARE-seq)
- Perform joint dimensionality reduction across RNA and chromatin accessibility
- Annotate cell types using both transcriptomic and epigenomic evidence
- Infer cis-regulatory peak–gene links and TF regulatory networks (GRNs)
- Benchmark integration quality across modalities
## Example triggers
- "Run scMultiome on my 10x Multiome data"
- "Integrate scRNA + scATAC and infer the gene regulatory network"
- "Jointly cluster my multiome data and annotate cell types"
## Step 0: Environment Setup
```bash
pip install muon scanpy scglue anndata mofapy2 leidenalg python-igraph matplotlib seaborn pandas numpy scipy --break-system-packages -q
```
## Step 1: Run Pipeline
```python
from multiome import run_multiome_skill
mdata, metrics, grn = run_multiome_skill(
input_path="your_multiome.h5mu",
out_dir="results",
run_scglue=True,
run_mofa=True,
run_grn=True,
max_epochs=200
)
```
## Output
- `multiome_integrated.h5mu`: full MuData object
- `cell_metadata.csv`: cluster labels per cell
- `peak_gene_links.csv`: GRN peak→gene pairs
- `joint_umap_clusters.png`: UMAP visualization
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.