PhasonFold: Multi-Scale Geometric Certificates for Auditable Protein-Folding Dynamics
PhasonFold: Multi-Scale Geometric Certificates for Auditable Protein-Folding Dynamics
Introduction
Protein structure prediction has been revolutionized by deep-learning methods such as AlphaFold2 and ESMFold, yet these models function as black-box oracles: they output coordinates without an auditable trace explaining why a particular fold is favored. We ask a complementary question: can one construct a geometric certificate — a compact, verifiable summary — that detects native-like backbone order without training on sequence data?
PhasonFold answers affirmatively by exploiting the mathematical structure of icosahedral quasicrystals. Real protein backbones, when lifted to a 6-dimensional integer lattice via oracle direction quantization, exhibit measurably lower symbolic entropy than correlation-destroying null controls. This signal is captured by two independent binary readouts whose near-zero cross-correlation demonstrates that the detected order is multi-faceted rather than an artifact of a single projection.
The entire pipeline is packaged as an executable skill (SKILL.md) reproducible by AI agents, satisfying the Claw4S requirement of full auditability from input PDB files to final effect-size tables.
Method
6D Embedding via Oracle Direction Quantization
Given a protein backbone as a sequence of C-alpha coordinates , we compute displacement vectors and quantize each into one of the 32 directions of the icosahedral triple232 alphabet. Each quantized direction maps to a step in via the standard icosahedral projection framework, yielding a 6D integer walk with .
The perpendicular-space component (the projection onto the 3D orthogonal complement of physical space in 6D) encodes phason deviations — departures from perfect quasicrystalline order.
Multi-Scale Symbolic Certificates
From the 6D walk we extract two binary readouts:
- (PCA-geometry): Project perpendicular-space coordinates onto their first principal component; threshold at the median to obtain a binary sequence .
- (parity-dynamics): Compute a parity function on the 6D step stream to obtain .
For each binary sequence and window length , we compute:
- Type entropy : Shannon entropy of the empirical distribution over possible -grams.
- Sliding-mean-binary entropy rate : an entropy-rate proxy capturing sequential predictability.
Lower entropy indicates more structured (less random) symbolic sequences.
Null Controls
We compare each real chain against four null models that progressively destroy geometric correlations:
- Random: Uniform random walks in .
- Perturbed: Real walks with added Gaussian noise.
- Shuffle: Random permutation of real step vectors (destroys sequential order, preserves marginals).
- Block-shuffle (): Permute blocks of consecutive steps (preserves short-range correlations up to scale ).
Cliff's delta quantifies effect size between real and null entropy distributions; means real chains are more ordered.
Projector Repair and Physical Realizability
Every 6D walk is verified for physical realizability: bond lengths must fall within Angstroms and the perpendicular-space RMS deviation must remain bounded. Chains failing these checks are flagged and excluded from aggregate statistics.
Results
1000-PDB Benchmark
We evaluated 1000 PDB chains (X-ray/cryo-EM, resolution Angstroms, length 60–350 residues) under the triple232 alphabet with .
| Readout | Metric | (real,shuffle) | (real,blk8) | median pct |
|---|---|---|---|---|
| (PCA) | (type) | 0.300 | ||
| (parity) | (type) | 0.200 | ||
| (PCA) | 0.300 | |||
| (parity) | 0.100 |
Both readouts detect significantly lower entropy in real chains compared to shuffle nulls ( ranging from to ). The parity readout achieves the strongest separation on the entropy-rate proxy (), with the median real chain falling at the 10th percentile of the null distribution.
Cross-readout correlation: , confirming the two certificates capture independent aspects of backbone order.
Hard-Negative Decoy Discrimination
On the Decoys 'R' Us 4state_reduced benchmark, the geometric certificate ranks native structures in the lowest 10% of entropy for the majority of targets. For canonical targets (1R69, 2CRO, 1CTF, 4RXN), the native fold consistently occupies the low-entropy tail of the decoy distribution, demonstrating discrimination power against near-native structural decoys.
Blind Search Rescue
In a blind search scenario where no sequence information is available, the entropy certificate provides a physics-based ranking criterion. Chains with anomalously low symbolic entropy under both readouts are enriched for native-like folds, offering a complementary signal to energy-based scoring functions.
Physical Realizability
Over 99% of PDB chains in our sample yield physically realizable 6D walks (bond lengths within tolerance, bounded phason deviation). The small fraction of failures correspond to structures with missing residues or non-standard backbone geometry.
Discussion
PhasonFold demonstrates that protein backbone geometry carries a detectable quasicrystalline signature when projected into 6D icosahedral space. This signature is:
- Statistically robust: Cliff's delta between and across 1000 chains and multiple readouts.
- Multi-faceted: Two independent readouts with zero cross-correlation capture complementary aspects of order.
- Discriminative: Native structures rank in the low-entropy tail against hard-negative decoys.
- Auditable: Every step from PDB coordinates to final effect sizes is deterministic and reproducible.
The framework does not replace sequence-based predictors but provides an orthogonal, physics-grounded certificate that can be computed in seconds per chain. Limitations include the fixed icosahedral projection (which may not capture all relevant symmetries) and the restriction to single-chain C-alpha backbones.
Author Contributions
W.Z. conceived PhasonFold, designed and implemented all core algorithms, conducted all experiments, and wrote the full-length manuscript. H.M. contributed to early-stage discussions. Claude Opus 4.6 (Anthropic) analyzed the Claw4S requirements, designed the executable skill architecture, wrote the SKILL.md workflow, and condensed the manuscript into the 3-page research note. Claw is listed as first author per Claw4S conference policy.
References
- Zhang, W. & Ma, H. (2026). PhasonFold: Auditable protein-folding dynamics via 6D quasicrystal projectors and phason feasibility relaxation. Omega-Protein-Folding evidence repository. https://github.com/AlyciaBHZ/Omega-Protein-Folding
- Jumper, J. et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596, 583–589.
- Lin, Z. et al. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379, 1123–1130.
- Senechal, M. (1995). Quasicrystals and Geometry. Cambridge University Press.
- de Bruijn, N.G. (1981). Algebraic theory of Penrose's non-periodic tilings of the plane. Kon. Nederl. Akad. Wetensch. Proc. Ser. A, 84, 39–66.
- Cliff, N. (1993). Dominance statistics: Ordinal analyses to answer ordinal questions. Psychological Bulletin, 114, 494–509.
Reproducibility: Skill File
Use this skill file to reproduce the research with an AI agent.
# PhasonFold: Auditable Protein-Folding Geometric Certificates at PDB Scale
> **Skill for Claw 🦞** — Executable benchmark pipeline for validating
> quasicrystal-derived geometric certificates on real protein structures.
## Overview
This skill runs the PhasonFold multi-scale symbolic certificate pipeline on
a reproducible set of PDB protein chains, comparing real backbone geometry
against correlation-destroying null controls (random, shuffle, block-shuffle).
The output is a quantitative report answering: *do real protein backbones
exhibit detectable geometric order under a 6D icosahedral projection, and
can this order discriminate native folds from hard-negative decoys?*
## Prerequisites
- Python 3.10+
- Git
- ~2 GB disk space for PDB cache
- Internet access (RCSB PDB download)
## Step 1 — Clone the repository
```bash
git clone https://github.com/AlyciaBHZ/Omega-Protein-Folding.git
cd Omega-Protein-Folding
```
## Step 2 — Install dependencies
```bash
pip install -r scripts/pdb_bench/requirements.txt
```
Required packages: `numpy`, `pandas`, `matplotlib`, `biopython`, `requests`, `torch`.
## Step 3 — Sample and download PDB structures
Download a reproducible sample of 200 protein chains (X-ray/cryo-EM,
resolution ≤ 2.5 Å, length 60–350 residues):
```bash
python scripts/pdb_bench/pdb_sample_and_download.py \
--n 200 --seed 0 --min-len 60 --max-len 350 \
--output-ids data/processed/pdb_bench/pdb_ids_n200_seed0.txt \
--cache-dir data/raw/pdb_cache
```
**Output**: `data/processed/pdb_bench/pdb_ids_n200_seed0.txt` — one PDB ID per line.
**Verify**: The file should contain exactly 200 PDB IDs.
```bash
wc -l data/processed/pdb_bench/pdb_ids_n200_seed0.txt
# Expected: 200
```
## Step 4 — Run phason proxy statistics with multi-scale certificates
This is the core analysis. For each protein, the script:
1. Recovers a 6D integer walk via oracle direction quantization (triple232 alphabet)
2. Computes perpendicular-space phason deviation (ph_rms, ph_max)
3. Computes multi-scale symbolic certificates (Fold_m) from two binary readouts:
- ρ_A (geometry): perpendicular-space PCA with median threshold
- ρ_B (dynamics): 6D step-stream parity binarization
4. Compares real chains against null controls (random, perturbed, shuffle, block-shuffle)
```bash
python scripts/pdb_bench/phason_stats.py \
--ids data/processed/pdb_bench/pdb_ids_n200_seed0.txt \
--tag claw4s_n200 \
--alphabet triple232 \
--random-reps 5 \
--perturb-reps 3 \
--shuffle-reps 10 \
--blockshuffle-reps 5 \
--blockshuffle-k 4,8,16 \
--auric \
--auric-m 6,8,10 \
--auric-readouts all \
--auric-rhoA-threshold median \
--auric-u-mode pca \
--seed 0 \
--checkpoint-every 50
```
**Runtime**: ~30–60 minutes for 200 proteins (depends on hardware).
**Outputs**:
- `data/processed/pdb_bench/phason_stats_summary_claw4s_n200.csv` — per-protein summary with Cliff's δ and percentiles
- `data/processed/pdb_bench/auric_stats_summary_claw4s_n200.csv` — multi-scale certificate metrics
- `artifacts/reports/pdb_phason_stats_claw4s_n200.md` — dataset-level narrative report
- `artifacts/reports/pdb_auric_stats_claw4s_n200.md` — certificate analysis report
**Verify**: Check that effect sizes show separation between real and null:
```bash
python -c "
import pandas as pd
df = pd.read_csv('data/processed/pdb_bench/auric_stats_summary_claw4s_n200.csv')
cols = [c for c in df.columns if 'delta' in c.lower() and 'shuffle' in c.lower()]
print('=== Certificate Effect Sizes (real vs shuffle) ===')
for c in cols[:6]:
print(f'{c}: mean = {df[c].mean():.3f}')
print('Negative delta = real chains are MORE ordered than shuffled nulls')
"
```
Expected: Cliff's δ < 0 for type_entropy (real chains have lower entropy = more ordered).
## Step 5 — Generate diagnostic plots
```bash
python scripts/pdb_bench/auric_plots.py \
--tag claw4s_n200 \
--alphabet triple232
```
**Outputs** (in `figures/pdb_bench/`):
- `auric_entropy_box_*.png` — boxplots of type entropy (real vs nulls)
- `auric_entropy_ecdf_*.png` — empirical CDFs
- `auric_entropy_multiscale_*.png` — multi-scale (m=6,8,10) trend curves
## Step 6 — Run hard-negative decoy benchmark (optional, ~10 min)
Test whether the geometric certificate discriminates native folds from
near-native but incorrect decoys (Decoys 'R' Us 4state_reduced):
```bash
python scripts/pdb_bench/auric_decoy_bench.py \
--decoy-set 4state_reduced \
--tag claw4s_decoy \
--m 8 \
--axis-mode shared_pca
```
**Outputs**:
- Per-target Cliff's δ and native percentile rank among decoys
- Native should rank in the low-entropy tail (lower percentile = more native-like)
**Verify**: On most targets (e.g. 1R69, 2CRO), native percentile < 10%.
## Step 7 — Interpret results
Read the generated reports:
```bash
cat artifacts/reports/pdb_auric_stats_claw4s_n200.md
```
**Key findings to check**:
1. **ρ_A (geometry)**: type entropy of real chains is significantly lower than shuffle/block-shuffle nulls (Cliff's δ ≈ −0.27, median percentile ≈ 0.30)
2. **ρ_B (dynamics)**: complementary signal, especially strong on entropy-rate proxy (Cliff's δ ≈ −0.49)
3. **Cross-readout independence**: corr(z_A, z_B) ≈ 0 — the two certificates capture distinct aspects of order
4. **Decoy discrimination**: native structures rank in the low-entropy tail for most targets
## Expected Reproduction Summary
| Metric | Readout | Expected δ(real,shuffle) | Expected median pct |
|--------|---------|--------------------------|---------------------|
| H(type) | ρ_A (PCA) | ≈ −0.27 | ≈ 0.30 |
| H(type) | ρ_B (parity) | ≈ −0.37 | ≈ 0.20 |
| ĥ_SMB | ρ_A (PCA) | ≈ −0.30 | ≈ 0.30 |
| ĥ_SMB | ρ_B (parity) | ≈ −0.49 | ≈ 0.10 |
(Reference: Full1k hybrid protocol, triple232, m=8)
## Troubleshooting
- **PDB download failures**: Some PDB IDs may be obsoleted. The script skips failures and reports them. Having 190+/200 successful downloads is acceptable.
- **Torch not found**: `torch` is optional for the certificate pipeline. Core phason statistics work without it.
- **Long runtime**: Use `--max-proteins 50` for a quick smoke test first.
## Citation
Zhang, W. & Ma, H. (2026). PhasonFold: Auditable protein-folding dynamics
via 6D quasicrystal projectors and phason feasibility relaxation.
*Omega-Protein-Folding evidence repository*.
https://github.com/AlyciaBHZ/Omega-Protein-Folding
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.