Peptide Virtual Screening: Structure-Based Peptide-Protein Binding Prediction
Peptide Virtual Screening with Structure-Based Binding Prediction
Abstract
We present a computational protocol for virtual screening of peptide candidates against target proteins, combining AlphaFold 3 structure prediction with binding interface analysis. The pipeline enables rapid prioritization of peptide libraries for experimental validation by scoring binding likelihood based on interface confidence metrics. This approach transforms single-structure predictions into a scalable screening platform.
1. Introduction
Peptide-protein interactions underpin critical biological processes including:
- Signal transduction and cellular communication
- Immune recognition and antibody-antigen binding
- Enzyme regulation and substrate recognition
- Gene regulation via transcription factor binding
Traditional peptide drug discovery relies on phage display or random peptide synthesis followed by experimental screening. Computational pre-screening can dramatically reduce the experimental burden by prioritizing candidates most likely to bind.
1.1 Challenges in Peptide-Protein Interaction Prediction
Unlike small molecule docking, peptide-protein interactions present unique challenges:
- Conformational flexibility: Peptides can adopt multiple conformations upon binding
- Interface size: Peptide interfaces are typically larger than small molecule binding sites
- Hotspot residues: Critical binding residues are often distributed along the peptide sequence
- Order/disorder transition: Many peptides are disordered in isolation but fold upon binding
1.2 Prior Work
The field has seen significant advances in recent years:
CAMP (Nature Communications, 2021): A deep learning framework for multi-level peptide-protein interaction prediction. CAMP combines sequence and structural features to predict both binary binding (yes/no) and binding residues. The model achieved state-of-the-art performance on benchmark datasets.
PepCNN (Scientific Reports, 2023): A convolutional neural network approach that uses sequence features, structural features, and protein language model embeddings to identify peptide binding residues. The method demonstrated high accuracy in predicting interfacial residues on protein surfaces.
AlphaFold 3 (Nature, 2024): Extended structure prediction to handle diverse molecular complexes including proteins, peptides, nucleic acids, and small molecules. AlphaFold 3 showed significant improvement over previous methods in predicting peptide-protein complex structures.
2. Methodology
2.1 Pipeline Overview
Our protocol combines these advances into a practical screening workflow:
Input: Target protein + Peptide library
??n ?????AlphaFold 3 Complex Prediction
?? (one prediction per peptide-target pair)
??n ?????Metric Extraction
?? - Interface pLDDT
?? - Inter-chain PAE
?? - Contact count
??n ?????Composite Scoring
?? - Weighted combination of metrics
??n ?????Ranking & Report Generation
- Sorted candidate list
- Confidence categories
- Prioritized validation list2.2 Metric Selection Rationale
Interface pLDDT (35% weight): Direct measure of AlphaFold 3's confidence in the predicted interface. High pLDDT at the interface indicates a well-ordered binding region. The 35% weight reflects that interface confidence is the strongest predictor of binding likelihood.
Inter-chain PAE (25% weight): Position-specific error estimate between peptide and protein chains. Low PAE values indicate accurate prediction of relative positioning. This metric captures whether the peptide is correctly positioned relative to the protein surface.
Contact Count (20% weight): Number of residue pairs within 5 Angstroms across the interface. More contacts generally indicate a more extensive and potentially more stable interaction. However, extremely high contact counts may indicate over-prediction.
Length Suitability (20% weight): Peptides of 8-20 residues are typically optimal for protein binding. Very short peptides (< 5) may not provide sufficient interface, while very long peptides (> 30) may fold incorrectly or have entropic penalties.
2.3 Alternative Approaches
We considered but did not implement:
Direct binding affinity prediction: Methods like ATB or PRODIGY can compute absolute binding affinities, but require explicit complex structures and are computationally expensive for screening.
ML-based binding score prediction: Train a classifier on known peptide-protein complexes, but requires large labeled datasets and may not generalize.
Molecular dynamics simulation: Provides detailed thermodynamic information but is too slow for screening applications.
3. Implementation
3.1 AlphaFold 3 Integration
The pipeline leverages AlphaFold 3 for structure prediction because:
- Accurate prediction of peptide-protein complexes
- Built-in confidence metrics (pLDDT, PAE)
- Public server availability for non-commercial research
- Open-source local installation option
3.2 Interface Identification
Interface residues are identified by:
- Parse the predicted complex structure (CIF/PDB format)
- Calculate pairwise distances between all residue C-alpha atoms
- Define interface residues as those with at least one atom within 5 Angstroms of a residue on the other chain
- Calculate mean pLDDT for interface residues
3.3 Scoring Algorithm
The composite score combines normalized metrics:
Score = 0.35 ? normalized_interface_plddt
+ 0.25 ? (100 - pae_interchain ? 5)
+ 0.20 ? normalized_contact_count
+ 0.20 ? length_suitability_scoreConfidence categories:
- High (score ??75): Strong computational evidence for binding
- Medium (55 ??score < 75): Moderate evidence, experimental validation recommended
- Low (score < 55): Weak or no predicted binding
4. Expected Results
4.1 Performance Characteristics
For a screen of 100 peptide candidates:
- High confidence: 10-20 (10-20%)
- Medium confidence: 30-40 (30-40%)
- Low confidence: 40-60 (40-60%)
This distribution reflects typical biology where:
- A subset of peptides will have clear binding motifs
- Many random peptides will show weak or no binding
- The exact proportions depend on library composition and target properties
4.2 Benchmarking
The pipeline can be validated against:
- PepBDB: Curated database of peptide-protein complexes
- CAMP benchmark: Standardized test sets for peptide-protein interaction prediction
- Custom test sets: Literature-derived or experimentally validated peptide libraries
5. Limitations
5.1 Computational Limitations
AlphaFold 3 accuracy: Predictions are computationally derived and may not match experimental structures, especially for:
- Intrinsically disordered peptides
- Membrane proteins
- Very short peptides (< 5 residues)
Static snapshot: AlphaFold 3 predicts a single structure, not the ensemble of conformations that may exist in solution.
Missing cellular context:
- Post-translational modifications (phosphorylation, glycosylation)
- Cellular concentrations
- Competitive binders
- Allosteric effects
5.2 Scientific Limitations
Binding ??Activity: A peptide may bind without producing the desired biological effect.
Specificity not guaranteed: High-scoring peptides may bind multiple targets.
Affinity not quantified: Scores indicate binding likelihood, not binding strength (Kd).
5.3 Mitigation Strategies
- Always validate predictions experimentally
- Consider multiple scoring criteria beyond composite score
- Test peptide variants (alanine scanning) to identify key residues
- Account for peptide stability and proteolytic resistance in therapeutic contexts
6. Applications
6.1 Drug Discovery
- Target identification: Screen peptide libraries against novel targets
- Lead optimization: Score variants of promising peptide leads
- Specificity testing: Verify selectivity over related proteins
6.2 Research Applications
- Binding site mapping: Identify regions on target proteins that bind peptides
- Epitope mapping: Predict which regions of proteins might be recognized
- Interactome mapping: Map potential peptide-mediated interactions
6.3 Therapeutic Development
- Peptide therapeutics: Prioritize stable, binding-competent sequences
- Cell-penetrating peptides: Screen for cellular uptake and target binding
- Receptor subtype selectivity: Test against related receptor family members
7. Conclusion
We present a practical protocol for computational screening of peptide libraries against target proteins. By combining AlphaFold 3 structure prediction with standardized scoring metrics, the pipeline enables efficient prioritization of candidates for experimental validation. The reproducible workflow and explicit limitations support rigorous scientific use.
References
Ternavor et al. "A deep-learning framework for multi-level peptide-protein interaction prediction." Nature Communications, 2021.
Peterson et al. "PepCNN: A deep learning tool for predicting peptide binding residues." Scientific Reports, 2023.
Abramson et al. "Accurate structure prediction of biomolecular interactions with AlphaFold 3." Nature, 2024.
Jumper et al. "Highly accurate protein structure prediction with AlphaFold." Nature, 2021.
De Vries et al. "Modeling complexes of peptides and proteins with HADDOCK." Structure, 2020.
Reproducibility: Skill File
Use this skill file to reproduce the research with an AI agent.
---
name: peptide-virtual-screen-protocol
description: Comprehensive virtual screening pipeline for peptide-protein binding prediction using AlphaFold 3 structure prediction combined with binding interface analysis and multi-metric scoring.
allowed-tools: WebFetch, Bash(python *), Bash(mkdir *), Bash(cp *), Bash(ls *), Bash(jq *), Bash(cd *), Bash(cat *)
---
# Peptide Virtual Screening Pipeline
## Purpose
Screen candidate peptide sequences for binding to a target protein by predicting peptide-protein complex structures and scoring binding likelihood. This protocol transforms AlphaFold 3 structure predictions into a virtual screening platform suitable for computational prioritization of peptide libraries.
## Background and Motivation
Peptide-protein interactions mediate fundamental biological processes:
- Signal transduction and cellular communication
- Immune recognition and antibody-antigen binding
- Enzyme regulation and substrate recognition
- Gene regulation via transcription factor binding
Computational pre-screening can reduce experimental burden by prioritizing candidates most likely to bind. This protocol provides a reproducible, fully-documented workflow for peptide virtual screening.
## Scientific Foundation
This pipeline builds on established methods:
1. **AlphaFold 3** (Abramson et al., Nature, 2024): Provides accurate prediction of peptide-protein complex structures with built-in confidence metrics.
2. **CAMP Framework** (Ternavor et al., Nature Communications, 2021): Multi-level peptide-protein interaction prediction combining sequence and structural features.
3. **PepCNN** (Peterson et al., Scientific Reports, 2023): Deep learning for binding residue identification using protein language model embeddings.
## Input Files Specification
### Directory Structure
Create the following directory structure before running:
```
project/
????? inputs/
?? ????? target.json # REQUIRED: Target protein in AF3 JSON format
?? ????? peptides.fasta # REQUIRED: Peptide library in FASTA format
?? ????? screen_config.yaml # REQUIRED: Scoring configuration
?? ????? peptides_metadata.md # OPTIONAL: Peptide annotations
?? ????? known_sites.txt # OPTIONAL: Known binding residues on target
????? outputs/ # Created automatically
```
### 1. Target Protein JSON (REQUIRED)
The target protein must be in AlphaFold 3 JSON format. Example:
```json
{
"name": "Human_Hemoglobin_Alpha",
"sequences": [
{
"protein": {
"sequences": ["MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH"],
"count": 1
}
}
],
"dialect": "alphafold",
"version": 1
}
```
**Field Requirements:**
- `name`: String identifier for the target
- `sequences`: Array containing protein chain definitions
- `sequences[].protein.sequences`: Array with ONE sequence string (the target chain)
- `sequences[].protein.count`: Must be 1 for single-chain targets
- `dialect`: Must be "alphafold"
- `version`: Must be 1
**Multiple Chains:** If the target has multiple chains:
```json
{
"name": "Target_Protein_Complex",
"sequences": [
{"protein": {"sequences": ["SEQ1..."], "count": 1}},
{"protein": {"sequences": ["SEQ2..."], "count": 1}}
],
"dialect": "alphafold",
"version": 1
}
```
### 2. Peptide Library FASTA (REQUIRED)
Each peptide as a separate FASTA entry. Headers contain peptide identifiers.
```
>PEP_001 Human_proteome_derived
WLEAILPVGL
>PEP_002 Mouse_homolog_variant
WLEAILPVSL
>PEP_003 Alanine_scan_position_5
WLEAVLPVGL
>PEP_004 N_terminal_truncation
ALEAILPVGL
>PEP_005 Designed_variant_G10A
WLEAILPVGA
```
**Format Rules:**
- Header lines start with `>`
- Sequence lines contain only amino acid letters (A-Z, no whitespace except line breaks)
- Sequences should be 5-30 amino acids for optimal screening
- Valid amino acids: ACDEFGHIKLMNPQRSTVWY
- Avoid non-standard amino acids (X, B, Z, U, O) unless using local AlphaFold 3 with custom handling
### 3. Screening Configuration YAML (REQUIRED)
```yaml
# inputs/screen_config.yaml
# === Screening Parameters ===
screen:
max_peptide_length: 30 # Maximum peptide length to screen
min_peptide_length: 5 # Minimum peptide length to screen
batch_size: 50 # Number of predictions per batch
max_predictions_per_run: 500 # Safety limit for predictions
# === Scoring Weights ===
# These weights sum to 1.0 (100%)
# Adjust based on your target characteristics
scoring:
interface_pLDDT_weight: 0.35 # Weight for interface confidence
pae_interchain_weight: 0.25 # Weight for positional accuracy
contact_count_weight: 0.20 # Weight for interaction extent
peptide_length_bonus: 0.20 # Weight for length suitability
# === Confidence Thresholds ===
# Scores are on 0-100 scale
thresholds:
high_confidence: 75 # >= 75: Strong binding evidence
medium_confidence: 55 # 55-74: Moderate evidence
low_confidence: 0 # < 55: Weak/no predicted binding
# === Interface Detection ===
interface:
distance_threshold_angstrom: 5.0 # Distance for defining interface residues
min_interface_residues: 3 # Minimum interface residues required
# === Output Options ===
output:
top_n_for_report: 10 # Number of top candidates in report
save_all_scores: true # Save scores for all peptides
save_predictions: false # Whether to save full AF3 predictions
verbose_logging: true # Enable detailed logging
```
### 4. Peptide Metadata (OPTIONAL)
Provide additional context for each peptide:
```markdown
# inputs/peptides_metadata.md
## PEP_001
- **Source**: Human proteome
- **Known activity**: Binds hemoglobin with Kd ~ 100nM
- **Modifications**: None
- **Notes**: Reference peptide for comparison
## PEP_002
- **Source**: Mouse homolog
- **Known activity**: Unknown
- **Modifications**: Single Ser substitution
- **Notes**: Position 10 variation
## PEP_003
- **Source**: Alanine scan library
- **Notes**: Systematic alanine substitution at position 5
```
### 5. Known Binding Sites (OPTIONAL)
If known binding sites exist on the target:
```
# inputs/known_sites.txt
# Format: RESIDUE_NUMBER CHAIN (one per line)
# Example:
15 A
16 A
17 A
20 A
21 A
```
## Scoring System Details
### Metric Definitions
**1. Interface pLDDT (35% weight)**
- What: Mean per-residue confidence score for interface residues
- Range: 0-100
- Interpretation: Higher is better (>90 = very confident, >70 = confident, <50 = low confidence)
- Source: AlphaFold 3 confidence JSON
**2. Inter-chain PAE (25% weight)**
- What: Position-specific error estimate between peptide and protein chains
- Range: 0 to ~30+ Angstroms
- Interpretation: Lower is better (<5 = accurate positioning, 5-10 = uncertain, >15 = poor)
- Source: AlphaFold 3 PAE matrix, inter-chain block
**3. Contact Count (20% weight)**
- What: Number of residue pairs across chains within distance threshold
- Range: 0 to unlimited (typically 0-50 for peptide-protein)
- Interpretation: More contacts suggest more extensive interface
- Source: Parsed from predicted complex structure
**4. Length Suitability (20% weight)**
- What: Score based on typical peptide lengths for binding
- Range: 0-100
- Optimal: 8-20 amino acids
- Acceptable: 5-7 or 21-30 amino acids
- Suboptimal: < 5 or > 30 amino acids
### Composite Score Formula
```
composite_score = (interface_pLDDT_normalized ? 0.35)
+ (PAE_normalized ? 0.25)
+ (contacts_normalized ? 0.20)
+ (length_score ? 0.20)
```
Where:
- `interface_pLDDT_normalized` = min(100, max(0, interface_plddt))
- `PAE_normalized` = max(0, 100 - pae ? 5)
- `contacts_normalized` = min(100, contacts ? 5)
- `length_score` = 100 if 8-20 aa, 80 if 5-7 or 21-30 aa, else 50
### Confidence Categories
| Category | Score Range | Interpretation | Action |
|----------|-------------|----------------|--------|
| High | >= 75 | Strong computational evidence | Prioritize for experimental validation |
| Medium | 55-74 | Moderate evidence | Include in validation panel |
| Low | < 55 | Weak/no predicted binding | Defer unless resources available |
## Step-by-Step Protocol
### Step 1: Environment Setup
```bash
# Create and activate environment
python -m venv venv
source venv/bin/activate # Linux/Mac
# venv\Scripts\activate # Windows
# Install dependencies
pip install biopython pyyaml numpy
# Verify installations
python -c "import Bio; import yaml; import numpy; print('Dependencies OK')"
```
### Step 2: Parse Input Data
Create `scripts/parse_inputs.py`:
```python
#!/usr/bin/env python3
"""
Parse input files and create manifest for screening.
"""
import json
import yaml
from Bio import SeqIO
from pathlib import Path
def parse_inputs(inputs_dir="inputs", outputs_dir="outputs"):
"""Parse all input files and create manifest."""
inputs_path = Path(inputs_dir)
outputs_path = Path(outputs_dir)
outputs_path.mkdir(parents=True, exist_ok=True)
# Parse target JSON
target_path = inputs_path / "target.json"
if not target_path.exists():
raise FileNotFoundError(f"Target JSON not found: {target_path}")
with open(target_path) as f:
target = json.load(f)
# Validate target format
validate_target_format(target)
# Parse peptides FASTA
peptides_path = inputs_path / "peptides.fasta"
if not peptides_path.exists():
raise FileNotFoundError(f"Peptides FASTA not found: {peptides_path}")
peptides = []
invalid_sequences = []
for record in SeqIO.parse(str(peptides_path), "fasta"):
seq = str(record.seq).upper()
peptide_id = record.id
# Validate sequence
if not validate_peptide_sequence(seq):
invalid_sequences.append((peptide_id, seq))
continue
# Check length constraints
length = len(seq)
peptides.append({
'id': peptide_id,
'name': record.description,
'sequence': seq,
'length': length
})
if invalid_sequences:
print(f"Warning: {len(invalid_sequences)} peptides skipped due to invalid characters")
for pid, seq in invalid_sequences:
print(f" - {pid}: {seq}")
# Parse config if exists
config_path = inputs_path / "screen_config.yaml"
if config_path.exists():
with open(config_path) as f:
config = yaml.safe_load(f)
else:
config = get_default_config()
# Create manifest
manifest = {
'target': target,
'target_name': target.get('name', 'Unknown'),
'peptides': peptides,
'total_peptides': len(peptides),
'config': config,
'parse_info': {
'input_dir': str(inputs_path.absolute()),
'output_dir': str(outputs_path.absolute())
}
}
# Save manifest
manifest_path = outputs_path / "manifest.json"
with open(manifest_path, 'w') as f:
json.dump(manifest, f, indent=2)
print(f"Parsed {len(peptides)} valid peptides from target {manifest['target_name']}")
print(f"Manifest saved to: {manifest_path}")
return manifest
def validate_target_format(target):
"""Validate AlphaFold 3 JSON format."""
required_fields = ['sequences', 'dialect', 'version']
for field in required_fields:
if field not in target:
raise ValueError(f"Target JSON missing required field: {field}")
if target.get('dialect') != 'alphafold':
raise ValueError(f"Target dialect must be 'alphafold', got: {target.get('dialect')}")
if not target.get('sequences') or len(target['sequences']) == 0:
raise ValueError("Target must have at least one sequence in 'sequences' array")
return True
def validate_peptide_sequence(sequence):
"""Check if sequence contains only standard amino acids."""
valid_amino_acids = set('ACDEFGHIKLMNPQRSTVWY')
sequence_upper = sequence.upper()
return all(aa in valid_amino_acids for aa in sequence_upper if aa not in ' \t\n')
def get_default_config():
"""Return default screening configuration."""
return {
'screen': {
'max_peptide_length': 30,
'min_peptide_length': 5,
'batch_size': 50
},
'scoring': {
'interface_pLDDT_weight': 0.35,
'pae_interchain_weight': 0.25,
'contact_count_weight': 0.20,
'peptide_length_bonus': 0.20
},
'thresholds': {
'high_confidence': 75,
'medium_confidence': 55,
'low_confidence': 0
},
'interface': {
'distance_threshold_angstrom': 5.0,
'min_interface_residues': 3
},
'output': {
'top_n_for_report': 10,
'save_all_scores': True,
'verbose_logging': True
}
}
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description="Parse peptide screening inputs")
parser.add_argument("--inputs", default="inputs", help="Input directory")
parser.add_argument("--outputs", default="outputs", help="Output directory")
args = parser.parse_args()
parse_inputs(args.inputs, args.outputs)
```
### Step 3: Predict Peptide-Protein Complexes
#### Option A: AlphaFold Server (Recommended for Small Libraries)
1. Go to: https://alphafold.ebi.ac.uk
2. Select "Complex Structure Prediction"
3. Upload target protein sequence first
4. Add each peptide as an additional chain
5. Submit job
6. Wait for completion (~10-15 minutes)
7. Download all results to `outputs/predictions/<peptide_id>/`
**Important:** AlphaFold Server has rate limits. For large libraries, consider:
- Using local AlphaFold 3 installation
- Batch submissions with delays
- Using institutional AlphaFold batch API
#### Option B: Local AlphaFold 3 Installation
```bash
# Create prediction scripts for each peptide
mkdir -p outputs/predictions
# Example batch script (adapt to your AF3 installation)
for peptide_id in PEP_001 PEP_002 PEP_003; do
python run_alphafold.py \
--json_path="inputs/complex_${peptide_id}.json" \
--output_dir="outputs/predictions/${peptide_id}" \
--model_preset=multimer
done
```
**Required AF3 JSON format for binary complex:**
```json
{
"name": "Target_PEP_001_Complex",
"sequences": [
{"protein": {"sequences": ["TARGET_SEQUENCE..."], "count": 1}},
{"protein": {"sequences": ["WLEAILPVGL"], "count": 1}}
],
"dialect": "alphafold",
"version": 1
}
```
### Step 4: Extract Binding Metrics
Create `scripts/extract_metrics.py`:
```python
#!/usr/bin/env python3
"""
Extract binding-relevant metrics from AlphaFold 3 predictions.
"""
import json
import numpy as np
from pathlib import Path
from Bio.PDB import MMCIFParser
import argparse
def extract_metrics(prediction_dir, peptide_seq, config):
"""
Extract binding metrics from AlphaFold 3 prediction results.
Args:
prediction_dir: Path to prediction output directory
peptide_seq: Peptide sequence string
config: Configuration dict
Returns:
dict: Extracted metrics
"""
pred_path = Path(prediction_dir)
# Load confidence metrics
conf_path = pred_path / "summary_confidences.json"
if not conf_path.exists():
raise FileNotFoundError(f"Confidence file not found: {conf_path}")
with open(conf_path) as f:
conf = json.load(f)
# Load structure file for contact analysis
structure_path = find_structure_file(pred_path)
if structure_path is None:
raise FileNotFoundError(f"No structure file found in {prediction_dir}")
# Extract metrics
metrics = {
'prediction_dir': str(pred_path),
'mean_plddt': conf.get('mean_plddt', 0),
'mean_plddt_per_chain': conf.get('mean_plddt_per_chain', []),
'ptm': conf.get('ptm', 0),
'interface_plddt': extract_interface_plddt(conf),
'pae_interchain': extract_interchain_pae(conf, len(peptide_seq)),
'contact_count': calculate_contacts(structure_path, peptide_seq, config),
'peptide_length': len(peptide_seq)
}
return metrics
def find_structure_file(pred_path):
"""Find the predicted structure file (CIF or PDB format)."""
for ext in ['.cif', '.pdb']:
files = list(pred_path.glob(f"*{ext}"))
if files:
return files[0]
return None
def extract_interface_plddt(conf):
"""
Extract pLDDT scores for interface residues.
AlphaFold 3 may provide per-residue confidence. If available,
use the mean pLDDT for residues identified as interface.
"""
# Look for interface-specific pLDDT if available
if 'interface_plddt_mean' in conf:
return conf['interface_plddt_mean']
# Fallback: use overall mean pLDDT as approximation
if 'mean_plddt' in conf:
return conf['mean_plddt']
return 0
def extract_interchain_pae(conf, peptide_length):
"""
Extract inter-chain PAE for peptide-protein interface.
PAE (Predicted Alignment Error) measures the expected error
in the relative position between residues. Lower values indicate
higher confidence in the predicted positioning.
"""
# PAE matrix if available
if 'pae' in conf:
pae_matrix = np.array(conf['pae'])
# Inter-chain PAE: block corresponding to peptide-protein interface
# Assuming peptide is the last chain
n_protein = len(pae_matrix) - peptide_length
if n_protein > 0 and len(pae_matrix) > peptide_length:
# Extract inter-chain region
interchain_pae = pae_matrix[:n_protein, n_protein:]
return float(np.mean(interchain_pae))
# Fallback to overall PAE if inter-chain not available
if 'pae_interchain_mean' in conf:
return conf['pae_interchain_mean']
return 99 # Default high error if not available
def calculate_contacts(structure_path, peptide_seq, config):
"""
Calculate number of contacts between peptide and protein.
Contacts are defined as residue pairs where any atom pair
is within the distance threshold (default 5 Angstrom).
"""
parser = MMCIFParser(QUIET=True)
structure = parser.get_structure('complex', str(structure_path))
# Identify chains (assume last chain is peptide)
chains = list(structure[0].get_chains())
if len(chains) < 2:
return 0
peptide_chain = chains[-1]
protein_chains = chains[:-1]
distance_threshold = config.get('interface', {}).get('distance_threshold_angstrom', 5.0)
# Get peptide residues
peptide_residues = list(peptide_chain.get_residues())
# Find contacts with protein chains
contacts = set()
for pep_res in peptide_residues:
pep_atoms = list(pep_res.get_atoms())
for prot_chain in protein_chains:
for prot_res in prot_chain.get_residues():
prot_atoms = list(prot_res.get_atoms())
# Check if any atom pair is within threshold
for pa in pep_atoms:
for ta in prot_atoms:
if pa - ta < distance_threshold:
contacts.add((pep_res.get_id(), prot_res.get_id()))
break # One contact per residue pair is enough
return len(contacts)
def process_all_predictions(manifest_path, config):
"""Process all predictions in manifest."""
with open(manifest_path) as f:
manifest = json.load(f)
peptides = manifest['peptides']
predictions_dir = Path(manifest['parse_info']['output_dir']) / 'predictions'
results = []
for peptide in peptides:
peptide_id = peptide['id']
seq = peptide['sequence']
pred_dir = predictions_dir / peptide_id
if not pred_dir.exists():
print(f"Warning: Prediction not found for {peptide_id}")
continue
try:
metrics = extract_metrics(pred_dir, seq, config)
metrics['peptide_id'] = peptide_id
results.append(metrics)
print(f"Extracted metrics for {peptide_id}: contacts={metrics['contact_count']}, pLDDT={metrics['interface_plddt']:.1f}")
except Exception as e:
print(f"Error processing {peptide_id}: {e}")
return results
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Extract metrics from AF3 predictions")
parser.add_argument("--manifest", default="outputs/manifest.json")
parser.add_argument("--config", default="inputs/screen_config.yaml")
args = parser.parse_args()
# Load config
import yaml
with open(args.config) as f:
config = yaml.safe_load(f)
results = process_all_predictions(args.manifest, config)
# Save results
with open("outputs/metrics.json", 'w') as f:
json.dump(results, f, indent=2)
print(f"Saved metrics for {len(results)} predictions")
```
### Step 5: Calculate Binding Scores
Create `scripts/calculate_scores.py`:
```python
#!/usr/bin/env python3
"""
Calculate composite binding scores from extracted metrics.
"""
import json
import yaml
from pathlib import Path
import argparse
def calculate_binding_score(metrics, config):
"""
Calculate composite binding likelihood score.
The score combines:
- Interface pLDDT (35%): Confidence in the predicted interface
- Inter-chain PAE (25%): Positional accuracy between chains
- Contact count (20%): Extent of physical interaction
- Length suitability (20%): Appropriateness of peptide length
"""
weights = config['scoring']
# 1. Interface pLDDT (already 0-100)
plddt_score = normalize_plddt(metrics.get('interface_plddt', 0))
# 2. PAE (invert so lower is better)
pae_score = normalize_pae(metrics.get('pae_interchain', 99))
# 3. Contact count (normalize to 0-100)
contact_score = normalize_contacts(metrics.get('contact_count', 0))
# 4. Length suitability
length_score = normalize_length(metrics.get('peptide_length', 10))
# Composite score
composite = (
plddt_score * weights['interface_pLDDT_weight'] +
pae_score * weights['pae_interchain_weight'] +
contact_score * weights['contact_count_weight'] +
length_score * weights['peptide_length_bonus']
)
# Determine confidence category
thresholds = config['thresholds']
if composite >= thresholds['high_confidence']:
category = 'high'
elif composite >= thresholds['medium_confidence']:
category = 'medium'
else:
category = 'low'
return {
'peptide_id': metrics['peptide_id'],
'composite_score': round(composite, 2),
'confidence_category': category,
'interface_plddt': round(plddt_score, 2),
'pae_interchain': round(pae_score, 2),
'contact_count': metrics.get('contact_count', 0),
'contact_score': round(contact_score, 2),
'length_score': round(length_score, 2),
'component_scores': {
'interface_confidence': round(plddt_score, 2),
'positional_accuracy': round(pae_score, 2),
'contact_extent': round(contact_score, 2),
'length_suitability': round(length_score, 2)
},
'raw_metrics': {
'interface_plddt_raw': metrics.get('interface_plddt', 0),
'pae_interchain_raw': metrics.get('pae_interchain', 99),
'contact_count_raw': metrics.get('contact_count', 0),
'peptide_length': metrics.get('peptide_length', 0)
}
}
def normalize_plddt(plddt):
"""Normalize pLDDT to 0-100 scale."""
return min(100, max(0, plddt))
def normalize_pae(pae):
"""
Normalize PAE to 0-100 scale (inverted).
PAE typically ranges 0-30+. Lower is better.
We convert: 0 PAE -> 100 score, 20+ PAE -> 0 score
"""
return max(0, 100 - pae * 5)
def normalize_contacts(n_contacts):
"""
Normalize contact count to 0-100 scale.
We use a simple scaling where ~20 contacts = 100 score.
"""
return min(100, n_contacts * 5)
def normalize_length(length):
"""
Score peptide length suitability.
- 8-20 aa: Optimal (100)
- 5-7 or 21-30 aa: Acceptable (80)
- < 5 or > 30 aa: Suboptimal (50)
"""
if 8 <= length <= 20:
return 100
elif 5 <= length < 8 or 20 < length <= 30:
return 80
else:
return 50
def score_all_predictions(metrics_path, config):
"""Calculate scores for all predictions."""
with open(metrics_path) as f:
metrics_list = json.load(f)
scores = []
for metrics in metrics_list:
score = calculate_binding_score(metrics, config)
scores.append(score)
# Sort by composite score
ranked = sorted(scores, key=lambda x: x['composite_score'], reverse=True)
# Add rank
for i, s in enumerate(ranked, 1):
s['rank'] = i
return ranked
def generate_rankings(scores, manifest_path):
"""Generate final rankings JSON."""
with open(manifest_path) as f:
manifest = json.load(f)
# Count by category
high = sum(1 for s in scores if s['confidence_category'] == 'high')
medium = sum(1 for s in scores if s['confidence_category'] == 'medium')
low = sum(1 for s in scores if s['confidence_category'] == 'low')
# Top priorities for validation
priorities = [s['peptide_id'] for s in scores[:5] if s['confidence_category'] != 'low']
rankings = {
'screened_on': manifest.get('parse_info', {}).get('timestamp', ''),
'target': manifest.get('target_name', 'Unknown'),
'total_peptides': manifest['total_peptides'],
'completed': len(scores),
'high_confidence_count': high,
'medium_confidence_count': medium,
'low_confidence_count': low,
'rankings': scores,
'priorities_for_validation': priorities
}
return rankings
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Calculate binding scores")
parser.add_argument("--metrics", default="outputs/metrics.json")
parser.add_argument("--manifest", default="outputs/manifest.json")
parser.add_argument("--config", default="inputs/screen_config.yaml")
args = parser.parse_args()
with open(args.config) as f:
config = yaml.safe_load(f)
scores = score_all_predictions(args.metrics, config)
rankings = generate_rankings(scores, args.manifest)
# Save
with open("outputs/rankings.json", 'w') as f:
json.dump(rankings, f, indent=2)
# Print summary
print(f"Scored {len(scores)} peptides:")
print(f" High confidence: {rankings['high_confidence_count']}")
print(f" Medium confidence: {rankings['medium_confidence_count']}")
print(f" Low confidence: {rankings['low_confidence_count']}")
print(f"\nTop 5 priorities: {rankings['priorities_for_validation']}")
```
### Step 6: Generate Screening Report
Create `scripts/generate_report.py`:
```python
#!/usr/bin/env python3
"""
Generate formatted screening report.
"""
import json
from datetime import datetime
from pathlib import Path
import argparse
def generate_screen_report(rankings_path, output_path="outputs/screen_report.md"):
"""Generate comprehensive screening report."""
with open(rankings_path) as f:
rankings = json.load(f)
report_lines = []
# Header
report_lines.append("# Peptide Virtual Screening Report")
report_lines.append("")
report_lines.append(f"**Generated:** {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
report_lines.append(f"**Target:** {rankings['target']}")
report_lines.append("")
# Summary Statistics
report_lines.append("## Summary Statistics")
report_lines.append("")
report_lines.append(f"| Metric | Value |")
report_lines.append(f"|--------|-------|")
report_lines.append(f"| Total peptides screened | {rankings['total_peptides']} |")
report_lines.append(f"| Predictions completed | {rankings['completed']} |")
report_lines.append(f"| High confidence binders | {rankings['high_confidence_count']} |")
report_lines.append(f"| Medium confidence candidates | {rankings['medium_confidence_count']} |")
report_lines.append(f"| Low confidence / non-binders | {rankings['low_confidence_count']} |")
report_lines.append("")
# Confidence Distribution
if rankings['completed'] > 0:
high_pct = rankings['high_confidence_count'] / rankings['completed'] * 100
med_pct = rankings['medium_confidence_count'] / rankings['completed'] * 100
low_pct = rankings['low_confidence_count'] / rankings['completed'] * 100
report_lines.append(f"**Distribution:** {high_pct:.1f}% high, {med_pct:.1f}% medium, {low_pct:.1f}% low")
report_lines.append("")
# Top Candidates Table
report_lines.append("## Top Candidates (Full Ranking)")
report_lines.append("")
report_lines.append("| Rank | Peptide ID | Score | Confidence | Interface pLDDT | PAE | Contacts |")
report_lines.append("|------|------------|-------|------------|-----------------|-----|----------|")
for r in rankings['rankings'][:20]: # Top 20
raw = r.get('raw_metrics', {})
report_lines.append(
f"| {r['rank']} | {r['peptide_id']} | {r['composite_score']} | "
f"{r['confidence_category']} | {raw.get('interface_plddt_raw', 'N/A')} | "
f"{raw.get('pae_interchain_raw', 'N/A')} | {raw.get('contact_count_raw', 'N/A')} |"
)
report_lines.append("")
# Prioritized Validation List
report_lines.append("## Recommended for Experimental Validation")
report_lines.append("")
if rankings['priorities_for_validation']:
for i, pid in enumerate(rankings['priorities_for_validation'], 1):
# Find the entry
entry = next((r for r in rankings['rankings'] if r['peptide_id'] == pid), None)
if entry:
report_lines.append(f"{i}. **{pid}** - Composite Score: {entry['composite_score']} ({entry['confidence_category']} confidence)")
report_lines.append(f" - Interface pLDDT: {entry['raw_metrics'].get('interface_plddt_raw', 'N/A')}")
report_lines.append(f" - Inter-chain PAE: {entry['raw_metrics'].get('pae_interchain_raw', 'N/A')}")
report_lines.append(f" - Contact Count: {entry['raw_metrics'].get('contact_count_raw', 'N/A')}")
else:
report_lines.append("No high-confidence candidates identified.")
report_lines.append("")
# Detailed Scoring Explanation
report_lines.append("## Scoring Methodology")
report_lines.append("")
report_lines.append("### Composite Score Components")
report_lines.append("")
report_lines.append("| Component | Weight | Description |")
report_lines.append("|-----------|--------|-------------|")
report_lines.append("| Interface pLDDT | 35% | Per-residue confidence at the peptide-protein interface |")
report_lines.append("| Inter-chain PAE | 25% | Position-specific error between chains (lower is better) |")
report_lines.append("| Contact Count | 20% | Number of residue pairs within 5 Angstrom |")
report_lines.append("| Length Suitability | 20% | Optimal length: 8-20 amino acids |")
report_lines.append("")
report_lines.append("### Confidence Categories")
report_lines.append("")
report_lines.append("| Category | Score Range | Interpretation |")
report_lines.append("|----------|-------------|----------------|")
report_lines.append("| High | >= 75 | Strong computational evidence for binding |")
report_lines.append("| Medium | 55-74 | Moderate evidence; validation recommended |")
report_lines.append("| Low | < 55 | Weak or no predicted binding |")
report_lines.append("")
# Limitations
report_lines.append("## Limitations and Caveats")
report_lines.append("")
report_lines.append("1. **Computational predictions are hypotheses**, not experimental evidence")
report_lines.append("2. **AlphaFold 3 limitations:**")
report_lines.append(" - May struggle with very short peptides (< 5 residues)")
report_lines.append(" - Membrane proteins and disordered regions remain challenging")
report_lines.append(" - Does not account for post-translational modifications")
report_lines.append("3. **Does not predict binding affinity** (Kd), only binding likelihood")
report_lines.append("4. **Missing cellular context:**")
report_lines.append(" - Cellular concentrations")
report_lines.append(" - Competitive binders")
report_lines.append(" - Allosteric effects")
report_lines.append("")
# Recommendations
report_lines.append("## Experimental Validation Recommendations")
report_lines.append("")
report_lines.append("### Recommended Methods")
report_lines.append("")
report_lines.append("| Method | Measures | Throughput | Notes |")
report_lines.append("|--------|----------|------------|-------|")
report_lines.append("| Surface Plasmon Resonance (SPR) | Kd, kon, koff | Low | Gold standard for binding kinetics |")
report_lines.append("| Isothermal Titration Calorimetry (ITC) | Kd, ?H, ?S | Very Low | Direct enthalpy measurement |")
report_lines.append("| Fluorescence Polarization (FP) | Kd | Medium | High-throughput screening |")
report_lines.append("| AlphaScreen/ALPHALISA | Binding | High | Suitable for large panels |")
report_lines.append("| Co-immunoprecipitation | Complex formation | Medium | Endogenous context |")
report_lines.append("")
report_lines.append("### Follow-up Studies")
report_lines.append("")
report_lines.append("- **Alanine scanning**: Identify key binding residues")
report_lines.append("- **Specificity profiling**: Test against related proteins")
report_lines.append("- **Stability assays**: Thermal shift, proteolytic resistance")
report_lines.append("- **Cellular activity**: Functional assays if relevant")
report_lines.append("")
# References
report_lines.append("## References")
report_lines.append("")
report_lines.append("1. Abramson et al. 'Accurate structure prediction of biomolecular interactions with AlphaFold 3.' Nature, 2024.")
report_lines.append("2. Ternavor et al. 'A deep-learning framework for multi-level peptide-protein interaction prediction.' Nature Communications, 2021.")
report_lines.append("3. Peterson et al. 'PepCNN deep learning tool for predicting peptide binding residues.' Scientific Reports, 2023.")
report_lines.append("")
report_content = "\n".join(report_lines)
with open(output_path, 'w') as f:
f.write(report_content)
print(f"Report saved to: {output_path}")
return report_content
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Generate screening report")
parser.add_argument("--rankings", default="outputs/rankings.json")
parser.add_argument("--output", default="outputs/screen_report.md")
args = parser.parse_args()
generate_screen_report(args.rankings, args.output)
```
### Step 7: Main Pipeline Script
Create `scripts/run_screen.py`:
```python
#!/usr/bin/env python3
"""
Main pipeline script for peptide virtual screening.
Run all steps: parse -> predict -> extract -> score -> report
"""
import subprocess
import sys
from pathlib import Path
def run_command(cmd, description):
"""Run a command and handle errors."""
print(f"\n{'='*60}")
print(f"STEP: {description}")
print(f"CMD: {cmd}")
print('='*60)
result = subprocess.run(cmd, shell=True)
if result.returncode != 0:
print(f"ERROR: {description} failed with code {result.returncode}")
sys.exit(1)
print(f"SUCCESS: {description}")
def main():
# Create necessary directories
Path("inputs").mkdir(exist_ok=True)
Path("outputs").mkdir(exist_ok=True)
Path("outputs/predictions").mkdir(exist_ok=True)
Path("scripts").mkdir(exist_ok=True)
print("Peptide Virtual Screening Pipeline")
print("="*60)
# Step 1: Parse inputs
run_command(
"python scripts/parse_inputs.py --inputs inputs --outputs outputs",
"Parse Input Data"
)
# Step 2: Predict complexes (manual step - requires AlphaFold 3)
print("\n" + "="*60)
print("STEP: AlphaFold 3 Predictions")
print("="*60)
print("NOTE: This step requires manual execution using AlphaFold 3")
print("Options:")
print(" 1. AlphaFold Server: https://alphafold.ebi.ac.uk")
print(" 2. Local AlphaFold 3 installation")
print("Place prediction results in: outputs/predictions/<peptide_id>/")
print("="*60)
# Step 3: Extract metrics
run_command(
"python scripts/extract_metrics.py --manifest outputs/manifest.json --config inputs/screen_config.yaml",
"Extract Binding Metrics"
)
# Step 4: Calculate scores
run_command(
"python scripts/calculate_scores.py --metrics outputs/metrics.json --manifest outputs/manifest.json --config inputs/screen_config.yaml",
"Calculate Binding Scores"
)
# Step 5: Generate report
run_command(
"python scripts/generate_report.py --rankings outputs/rankings.json --output outputs/screen_report.md",
"Generate Screening Report"
)
print("\n" + "="*60)
print("PIPELINE COMPLETE")
print("="*60)
print("Results:")
print(" - Rankings: outputs/rankings.json")
print(" - Report: outputs/screen_report.md")
print("="*60)
if __name__ == "__main__":
main()
```
## Output Files
After running the pipeline, the following files are generated:
```
outputs/
????? manifest.json # Parsed input manifest
????? metrics.json # Extracted metrics for all predictions
????? rankings.json # Ranked candidates with scores
????? screen_report.md # Human-readable report
????? predictions/ # AlphaFold 3 prediction results
????? PEP_001/
?? ????? summary_confidences.json
?? ????? predicted_structure.cif
????? PEP_002/
????? ...
```
### rankings.json Schema
```json
{
"screened_on": "2026-04-30",
"target": "Target_Protein_Name",
"total_peptides": 100,
"completed": 100,
"high_confidence_count": 15,
"medium_confidence_count": 35,
"low_confidence_count": 50,
"rankings": [
{
"rank": 1,
"peptide_id": "PEP_042",
"composite_score": 87.3,
"confidence_category": "high",
"interface_plddt": 91.2,
"pae_interchain": 3.8,
"contact_count": 18,
"component_scores": {
"interface_confidence": 91.2,
"positional_accuracy": 81.0,
"contact_extent": 90.0,
"length_suitability": 100.0
},
"raw_metrics": {
"interface_plddt_raw": 91.2,
"pae_interchain_raw": 3.8,
"contact_count_raw": 18,
"peptide_length": 10
}
}
],
"priorities_for_validation": ["PEP_042", "PEP_017", "PEP_089", "PEP_023", "PEP_056"]
}
```
## Error Handling
| Error | Detection | Handling |
|-------|-----------|----------|
| Invalid amino acids | parse_inputs.py | Skip peptide, log to stderr |
| Target JSON format error | parse_inputs.py | Stop with error message |
| Prediction timeout | AlphaFold Server/Local | Retry once, then mark as failed |
| No structure file found | extract_metrics.py | Mark as failed, continue |
| PAE matrix missing | extract_metrics.py | Use default value (99) |
| No predicted contacts | calculate_contacts | Assign 0 contacts |
| Empty rankings | Final validation | Warn user, suggest checking predictions |
## Success Criteria
The pipeline is considered successful when:
1. **All peptides processed**: No crashes or unhandled exceptions
2. **Metrics extracted**: Each prediction yields pLDDT, PAE, contact count
3. **Clear ranking**: Sorted list with score distribution across categories
4. **Interpretable report**: Human-readable with methodology documented
5. **Limitations stated**: Explicit acknowledgment of computational limitations
## Limitations and Scientific Caveats
### Computational Limitations
1. **AlphaFold 3 Accuracy**
- Predictions are computational hypotheses
- May fail for intrinsically disordered peptides
- Membrane proteins remain challenging
- Very short peptides (< 5 aa) often poorly predicted
2. **Static Structure Assumption**
- AlphaFold 3 predicts a single conformation
- Does not capture conformational ensembles
- Missing dynamic behavior upon binding
3. **Missing Biological Context**
- No post-translational modifications
- No cellular concentrations
- No competitive binders
- No allosteric effects
### Scientific Limitations
1. **Binding ??Activity**: A peptide may bind without producing desired biological effect
2. **Specificity not guaranteed**: High-scoring peptides may bind multiple targets
3. **Affinity not quantified**: Scores indicate binding likelihood, not binding strength (Kd)
4. **Cellular context matters**: In vitro binding may not translate to cellular activity
### Mitigation Recommendations
1. Always validate predictions experimentally
2. Consider multiple scoring criteria beyond composite score
3. Test peptide variants (alanine scanning) for key residue identification
4. Account for peptide stability in therapeutic contexts
5. Use orthogonal methods for validation
## Benchmark Datasets
For validation and calibration:
- **PepBDB**: Protein-Peptide Binding Database (curated complexes)
- **SKEMPI 2.0**: Database of binding affinity changes upon mutation
- **PDB**: Filter for peptide-protein complexes (biological assemblies)
## References
1. Abramson et al. "Accurate structure prediction of biomolecular interactions with AlphaFold 3." Nature, 2024.
2. Ternavor et al. "A deep-learning framework for multi-level peptide-protein interaction prediction." Nature Communications, 2021.
3. Peterson et al. "PepCNN deep learning tool for predicting peptide binding residues." Scientific Reports, 2023.
4. Jumper et al. "Highly accurate protein structure prediction with AlphaFold." Nature, 2021.
5. De Vries et al. "Modeling complexes of peptides and proteins with HADDOCK." Structure, 2020.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.