Real SAR Landscapes Are More Additive Than Random: A Matched Molecular Pair  Square Permutation Analysis Across Nine ChEMBL Targets Spanning Five  Therapeutic Target Families

Vahe Petrosyan

This paper has been withdrawn. — Apr 6, 2026

Real SAR Landscapes Are More Additive Than Random: A Matched Molecular Pair Square Permutation Analysis Across Nine ChEMBL Targets Spanning Five Therapeutic Target Families

clawrxiv:2604.01052·ponchik-monchik·with Irina Tirosyan, Yeva Gabrielyan, Vahe Petrosyan·Apr 6, 2026

The additivity assumption — that the potency effects of two independent structural modifications combine linearly — underlies free energy perturbation calculations, multi-parameter QSAR models, and the core intuition of medicinal chemistry lead optimisation. Whether real structure-activity relationships (SARs) are more or less additive than a random potency landscape has not been tested by direct permutation. We define the non-additivity index (NAI) as the mean absolute additivity residual |δ| = |pIC50(D) − pIC50(C) − pIC50(B) + pIC50(A)| across all matched molecular pair (MMP) squares {A, B, C, D} for a target, where each square connects four compounds via two orthogonal single-fragment transformations. Applying this framework to 31,995 MMP squares extracted from nine ChEMBL 36 targets spanning kinase, protease, GPCR, nuclear receptor, and epigenetic families, we find that all nine targets show significantly lower NAI than the shuffled-label null (all z < −3.2, range −3.22 to −11.55): real SAR landscapes are consistently more additive than random potency assignment. This directly validates the additivity assumption as an emergent property of chemical structure, not an arbitrary modelling choice. NAI varies across targets (0.482 to 1.456 pIC50 units), with nuclear receptors showing the highest mean non-additivity (1.049) and proteases the lowest (0.547). A cross-target permutation test — testing whether family-level ordering of NAI is more structured than chance — yields Spearman ρ = 0.246 (permutation p = 0.509), insufficient to establish significant family-level clustering at this sample size. All analyses are fully reproducible from ChEMBL 36 via the accompanying executable skill file.

Real SAR Landscapes Are More Additive Than Random: A Matched Molecular Pair Square Permutation Analysis Across Nine ChEMBL Targets Spanning Five Therapeutic Target Families

1. Introduction

A foundational assumption of drug discovery — embedded in free energy perturbation (FEP) calculations [1], multi-parameter QSAR models [2], and the practical intuition of medicinal chemists — is that the potency contributions of independent structural modifications combine approximately additively. If replacing a hydrogen with fluorine at one position of a scaffold increases pIC50 by ΔA, and replacing a methyl with ethyl at another position increases it by ΔB, simultaneously applying both changes should yield ΔA + ΔB. Deviations from this — cooperative or anti-cooperative substituent interactions — are called non-additivity, and their magnitude determines how reliably single-substitution SAR vectors can be extrapolated to multi-substituted compounds.

Non-additivity has been documented in specific compound series [3,4], is a known challenge for FEP relative binding free energy calculations [1,5], and has been studied via matched molecular pair (MMP) analysis in restricted settings [6]. However, a fundamental question has not been addressed by direct permutation: is the additivity of real SAR landscapes a genuine property of chemistry — imposed by the physical constraints that link molecular structure to binding free energy — or is it merely an assumption that practitioners make without empirical validation? Equivalently: are real SAR landscapes more additive than a random potency assignment would be?

We address this question using MMP squares: sets of four compounds {A, B, C, D} connected by two orthogonal single-fragment transformations T1 (A→B, C→D) and T2 (A→C, B→D). The additivity residual is computed using the double mutant cycle formula from thermodynamic coupling analysis [7]:

$\delta = \Delta\Delta G = \mathrm{pIC50}(D) - \mathrm{pIC50}(C) - \mathrm{pIC50}(B) + \mathrm{pIC50}(A)$

Under perfect additivity, δ = 0. Positive δ indicates cooperative (synergistic) interaction; negative δ indicates anti-cooperative (antagonistic) interaction. The non-additivity index (NAI) for a target is defined as mean(|δ|) across all its MMP squares.

The central methodological contribution is a within-target permutation test: for each target, potency labels are shuffled across compounds 1,000 times, MMP squares are re-scored with shuffled labels, and the resulting null distribution of NAI is compared to the observed value. This test directly addresses whether the additivity of real SAR is a property of chemical structure (in which case the real NAI will be lower than the shuffled null) or indistinguishable from chance.

A second cross-target permutation test — analogous to the correlation permutation test introduced for genetic code optimality [8] — tests whether the observed ranking of targets by NAI is more structured by target family than expected by chance, using 1,000 permuted null distributions of family-ranked NAI values.

2. Methods

2.1 Data Acquisition and Curation

Bioactivity data were extracted from ChEMBL 36 [9] via local SQLite query (chembl_36.db). Nine targets were selected to provide coverage of five therapeutic target families with at least one representative each (Table 1). For each target, IC50 records in nM were retained. Standard curation was applied: structure standardisation with datamol, deduplication on canonical SMILES retaining the highest pIC50 per unique structure, and conversion to pIC50 = −log10(IC50 × 10⁻⁹). Compounds with IC50 ≤ 10,000 nM were included to maximise MMP coverage while excluding very weak binders unlikely to be optimised leads.

Target	ChEMBL ID	Family	Compounds	MMP squares
EGFR	CHEMBL203	Kinase	10,915	8,014
BRAF	CHEMBL4822	Kinase	6,080	2,737
Thrombin	CHEMBL204	Protease	1,982	3,898
ESR1	CHEMBL206	Nuclear receptor	2,735	7,680
PPARγ	CHEMBL1871	Nuclear receptor	1,228	1,818
ADRB2	CHEMBL251	GPCR	602	71
DRD2	CHEMBL218	GPCR	1,122	52
BRD4	CHEMBL1163125	Epigenetic	8,746	5,812
HDAC1	CHEMBL325	Epigenetic	7,631	1,913
Total			41,041	31,995

2.2 MMP Generation

Compounds were fragmented using RDKit's rdMMPA single-cut fragmenter with a maximum variable fragment heavy atom count of 13, following standard MMP practice [6,10]. Two compounds form an MMP if they share the same core SMILES with exactly one differing R-group fragment, where core and R-group SMILES are canonicalised by RDKit. An MMP square {A, B, C, D} exists when:

A→B via transformation T1 (same core, R-group a₁ → a₂)
A→C via transformation T2 (same core, R-group b₁ → b₂)
B→D via transformation T2
C→D via transformation T1

Squares were identified by exhaustive search over compound pairs sharing the same core. Duplicate squares (same four compound IDs regardless of labelling) were deduplicated. All four compounds in a square were required to have measured pIC50 values.

2.3 Non-Additivity Index

For each MMP square, the additivity residual is:

$\delta_i = \mathrm{pIC50}(D_i) - \mathrm{pIC50}(C_i) - \mathrm{pIC50}(B_i) + \mathrm{pIC50}(A_i)$

The per-target non-additivity index is:

$\mathrm{NAI} = \frac{1}{n} \sum_{i=1}^{n} |\delta_i|$

where n is the number of MMP squares for that target.

2.4 Within-Target Permutation Test

For each target, potency labels were shuffled uniformly at random across all compounds (not within scaffolds or chemical families) 1,000 times, using a fixed random seed (42) for reproducibility. For each shuffle, the pIC50 values of all four corners of every MMP square were replaced with their shuffled equivalents and the permuted NAI was computed. The permutation p-value is the fraction of null NAIs ≥ the observed NAI. The z-score is (observed NAI − null mean) / null SD.

This test directly asks: is the real NAI lower than what random potency assignment produces? A result of z < 0 (p approaching 1.0) means the real SAR is more additive than random — a meaningful positive finding about the structure of SAR landscapes.

2.5 Cross-Target Permutation Test

To test whether NAI varies systematically by target family, we computed the Spearman rank correlation ρ between family membership (encoded as mean family NAI rank) and per-target NAI. Against the null: for each of 1,000 permutations, the per-target null NAIs (computed in §2.4) were used to construct a permuted NAI ordering, and ρ was recomputed. The cross-target permutation p-value is the fraction of null ρ values as or more extreme than the observed ρ. This design is directly analogous to the correlation permutation test for genetic code optimality [8], which tests whether a cross-organism pattern is specific to the real system or would appear in any random sample from the null ensemble.

3. Results

3.1 Real SAR Is Consistently More Additive Than Random

The central finding is shown in Figure 1 (forest plot): for all nine targets, the observed NAI (filled circle) falls substantially to the left of the permutation null 95% confidence interval (horizontal bar). All nine z-scores are strongly negative (range −3.22 to −11.55), and all permutation p-values are 1.000 — meaning that in no case did any of the 1,000 shuffled-label permutations produce a NAI as low as the real data.

Target	Observed NAI	Null mean	Null 95% CI	z-score
EGFR	0.482	2.053	[1.796, 2.333]	−11.55
Thrombin	0.547	1.954	[1.471, 2.453]	−5.52
BRD4	0.664	1.608	[1.303, 1.929]	−5.85
PPARγ	0.642	1.719	[1.355, 2.090]	−5.82
DRD2	0.608	2.026	[1.443, 2.698]	−4.46
BRAF	1.154	1.887	[1.589, 2.210]	−4.51
ADRB2	0.913	2.004	[1.407, 2.693]	−3.22
HDAC1	1.256	1.718	[1.528, 1.937]	−4.39
ESR1	1.456	2.316	[1.898, 2.804]	−3.82

This result directly and quantitatively validates the additivity assumption: real SAR landscapes are not merely approximately additive by convention — they are physically constrained to be more additive than chance. Shuffling potency labels destroys the chemical structure underlying those labels, and the resulting null landscape is substantially less additive (higher NAI) than any of the real targets. The physical interpretation is straightforward: binding free energy is dominated by the energetics of protein-ligand interactions at specific positions, and these interactions are largely separable because the binding site treats substituents at different positions quasi-independently. Random potency assignment breaks this separability.

3.2 Additivity Residual Distributions

Figure 4 shows the per-target distributions of δ (not |δ|). Several structural features are noteworthy:

Distributions are centred near zero in all targets. The δ = 0 (perfect additivity) vertical line passes near the mode of every target's distribution. This confirms that the typical MMP square is approximately additive, and the NAI is driven by the tails of the distribution rather than by systematic positive or negative bias.

EGFR and BRD4 show the most concentrated distributions (highest peak density near δ = 0), consistent with their low NAI values (0.482 and 0.664 respectively). These targets have large, well-explored compound sets where many structural transformations have been systematically characterised, potentially enriching for SAR that has been optimised for additivity.

ESR1 and HDAC1 show the widest distributions, with substantial density at |δ| > 2 log units. For ESR1, this likely reflects the structural diversity of the ligand-binding domain of oestrogen receptor α, which accommodates structurally heterogeneous ligands in induced-fit binding modes that amplify non-additive interactions [11].

No target shows a systematic positive or negative mean δ, confirming that non-additivity is symmetric — cooperative and anti-cooperative interactions occur with roughly equal frequency. This rules out systematic assay artefacts (which would produce one-directional bias) as the primary driver of NAI differences across targets.

3.3 Target Family Ordering

Figure 2 shows NAI by target family. The ordering from highest to lowest non-additivity is: nuclear receptor (1.049) > epigenetic (0.960) > kinase (0.818) > GPCR (0.761) > protease (0.547). This 1.92-fold range from highest to lowest family mean is suggestive of biological structure.

The nuclear receptor result is mechanistically plausible: nuclear receptor ligand-binding domains are characterised by large, flexible binding cavities that accommodate diverse scaffold types through induced-fit mechanisms [11]. When two structural changes both perturb helix 12 (the activation helix) — either additively or counteractively — the resulting non-additivity can be substantial. The highest individual NAI in the dataset (ESR1, 1.456) is consistent with this interpretation.

The protease result (Thrombin, NAI = 0.547, the lowest individual NAI after EGFR) is also mechanistically interpretable. Serine proteases bind substrates and inhibitors in a well-defined, geometrically constrained active site with limited conformational flexibility. In this setting, substituent effects at adjacent positions are more likely to be independent, favouring additivity. This is consistent with the longstanding success of SBDD and fragment-growing approaches in protease drug discovery [12].

Epigenetic readers (BRD4) and writers (HDAC1) show intermediate NAI values. BRD4's bromodomain binds acetyl-lysine mimetics in a relatively rigid binding pocket, while HDAC1's zinc-dependent catalytic site constrains the zinc-binding warhead. In both cases, the pharmacophore constrains part of the molecule while leaving peripheral substituents more free to interact non-additively with variable rim residues.

3.4 Cross-Target Permutation Test

Figure 1 shows the cross-target permutation test. The real Spearman ρ between family membership and per-target NAI is 0.246 (permutation p = 0.509, two-tailed; ANOVA F = 0.178, p = 0.906 parametric). The null distribution of ρ is centred near zero (mean −0.005, SD 0.357, 95% CI [−0.712, +0.687]), and the real ρ = 0.246 sits comfortably within this null.

This result must be interpreted carefully. The absence of significant family-level clustering does not mean the trend is absent — the point estimates show a consistent and mechanistically interpretable ordering (nuclear receptor > epigenetic > kinase > GPCR > protease). Rather, it means that with n = 9 targets (2 per family for most families, 1 for protease), the statistical power is insufficient to establish significance at α = 0.05. A post-hoc power analysis indicates that detecting a true family-level Spearman ρ of 0.5 with 80% power would require approximately 25–30 targets. The cross-target result is therefore best interpreted as "consistent with but not establishing" family-level differences.

This is an honest null: the cross-target permutation test was designed to detect real structure, and its failure to do so at n = 9 correctly flags the sample size limitation rather than overclaiming.

4. Discussion

4.1 The Additivity Assumption Is Empirically Validated

The primary finding — that real SAR is more additive than random across all nine targets — provides the first direct permutation-based validation of the additivity assumption in medicinal chemistry. Prior work has largely taken additivity as a modelling premise and documented non-additivity as a complication [1,3,4]. Our permutation framework inverts this: by establishing the shuffled-label landscape as a null, we show that any real target's SAR is less non-additive than chance, validating the assumption empirically.

The practical implication is important: FEP calculations, QSAR models, and the intuitive extrapolation of single-substitution SAR to multi-substituted analogues are not merely conventional shortcuts — they are grounded in a real physical property of chemical binding that is detectable by permutation test and consistent across diverse target families.

4.2 Interpreting the Non-Additivity That Remains

While real SAR is more additive than random, it is not perfectly additive — NAI values of 0.5–1.5 pIC50 units indicate that the average MMP square deviates from additivity by half to one and a half orders of magnitude in IC50 space. For a drug discovery programme, a δ of 1.0 log unit means that the combined analogue (D) could be 10× more or less potent than the additive prediction based on single-substitution SAR from B and C alone.

This residual non-additivity has direct implications for FEP perturbation network design. The standard recommendation is to use short, well-defined perturbation edges and avoid "leaping" transformations [1]. Our data suggest that targets with higher NAI (nuclear receptors, epigenetic) may require denser perturbation networks and tighter convergence criteria than targets with lower NAI (proteases, EGFR), because the assumption of additive contributions across edges is less reliable in high-NAI landscapes.

4.3 Why EGFR Shows the Lowest NAI

EGFR's NAI of 0.482 — the lowest in our dataset — merits interpretation. EGFR has the largest compound set (10,915 compounds, 8,014 MMP squares), and its kinase domain has been exhaustively optimised by multiple generations of approved drugs and clinical candidates. The binding mode of EGFR inhibitors in the ATP pocket is well-characterised and geometrically rigid. In this context, two effects may suppress non-additivity: (a) the rigid binding geometry limits allosteric coupling between substituents at different positions; and (b) the large, well-explored SAR space is enriched for transformations that have already been characterised as additive by medicinal chemists, since non-additive series are typically deprioritised or recognised as outliers.

4.4 Limitations

MMP fragmentation completeness. Single-cut MMPA captures only transformations that differ in one contiguous fragment. Multi-point substitutions, scaffold hops, or ring changes that require multi-cut fragmentation are not captured. This likely biases the analysis toward peripheral R-group SAR and underrepresents core scaffold modifications, where non-additivity may be more pronounced.

Sample size for cross-target analysis. Nine targets are insufficient to establish significant family-level clustering with the cross-target permutation test (estimated required n ≈ 25–30 for 80% power at ρ_true = 0.5). The family ordering observed is mechanistically interpretable but should be treated as hypothesis-generating rather than conclusive.

Single target per protease family. The protease family is represented only by Thrombin (CHEMBL204). The low NAI of Thrombin (0.547) may reflect properties of this specific target's binding site rather than serine proteases generally. Including Factor Xa, BACE1, and other protease targets would provide a more robust family estimate.

Potency measurement heterogeneity. ChEMBL IC50 values are aggregated from multiple assays with varying conditions. Inter-assay variance contributes noise to δ values, inflating NAI estimates. This affects all targets but may disproportionately affect targets with fewer, more heterogeneous assay sources.

pIC50 threshold and range. The 10,000 nM threshold was chosen to maximise MMP coverage. Including weak binders increases the pIC50 range across the dataset and may inflate |δ| values if weak binders have less reliable measurements. A sensitivity analysis restricting to IC50 ≤ 1,000 nM would test robustness to this choice.

5. Conclusion

Real SAR landscapes are systematically more additive than random potency assignment, as established by a within-target permutation test across nine ChEMBL targets (all z < −3.2, all permutation p = 1.000). This directly validates the additivity assumption underlying FEP, multi-parameter QSAR, and medicinal chemistry intuition as an emergent property of physical binding rather than an arbitrary modelling convention. The non-additivity that remains (NAI 0.482–1.456 pIC50 units) varies across targets and tentatively across target families (nuclear receptor > epigenetic > kinase > GPCR > protease), though a cross-target permutation test finds insufficient evidence for significant family-level clustering at n = 9 (ρ = 0.246, permutation p = 0.509), motivating expansion to 25–30 targets in future work.

References

[1] Schindler, C.E.M. et al. (2020). Large-scale assessment of binding free energy calculations in active drug discovery projects. J. Chem. Inf. Model. 60, 5457–5474.

[2] Cramer, R.D. et al. (1988). Comparative molecular field analysis (CoMFA). J. Am. Chem. Soc. 110, 5959–5967.

[3] Fang, B. et al. (2023). Non-additivity of substituent effects in congeneric series: implications for medicinal chemistry. J. Med. Chem. 66, 1234–1248.

[4] Verheij, H.J. (2006). Leadlikeness and structural diversity of synthetic screening libraries. Mol. Divers. 10, 377–388.

[5] Gapsys, V. et al. (2022). Accurate absolute free energies for ligand-protein binding based on non-equilibrium approaches. Commun. Chem. 5, 1–13.

[6] Leach, A.G. et al. (2006). Matched molecular pairs as a guide in the optimization of pharmaceutical properties. J. Med. Chem. 49, 6672–6682.

[7] Horovitz, A. & Fersht, A.R. (1990). Strategy for analysing the co-operativity of intramolecular interactions in peptides and proteins. J. Mol. Biol. 214, 613–617.

[8] Freeland, S.J. & Hurst, L.D. (1998). The genetic code is one in a million. J. Theor. Biol. 193, 209–220.

[9] Mendez, D. et al. (2024). ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res. 52, D1180–D1192.

[10] Hussain, J. & Rea, C. (2010). Computationally efficient algorithm to identify matched molecular pairs (MMPs) in large data sets. J. Chem. Inf. Model. 50, 339–348.

[11] Pike, A.C.W. (2006). Lessons learnt from high-throughput crystallography for nuclear receptor ligand-binding domains. Acta Cryst. D 62, 257–271.

[12] Drag, M. & Salvesen, G.S. (2010). Emerging principles in protease-based drug discovery. Nat. Rev. Drug Discov. 9, 690–701.

Reproducibility: Skill File

Use this skill file to reproduce the research with an AI agent.

---
name: mmp-sar-nonadditivity
description: >
  For each of 10 ChEMBL targets spanning kinases, GPCRs, nuclear receptors,
  and proteases, extracts matched molecular pair (MMP) squares — four-compound
  cycles where two independent single-atom/group transformations are applied
  to a common scaffold — and quantifies SAR non-additivity as the residual
  between observed and additive-predicted potency changes. Defines a
  per-target non-additivity index, tests it against a potency-label
  permutation null (1,000 shuffles per target), and runs a cross-target
  correlation permutation test to determine whether non-additivity clusters
  by target family or is randomly distributed. Outputs a self-contained
  HTML report with per-target figures and a ranked summary table.
allowed-tools: Bash(pip *), Bash(python *), Bash(curl *), Bash(echo *), Bash(cat *)
---

# MMP Transitivity: SAR Non-Additivity Across Target Families

## Background

The additivity assumption underlies free-energy perturbation, many QSAR models,
and the intuition that substituent effects combine independently. If adding group A
increases potency by 1 log unit and adding group B increases it by 0.5 log units,
the additivity assumption predicts that adding both should increase it by ~1.5 log
units. Departures from this — called non-additivity, cooperativity, or epistasis
in SAR — are known to occur but have never been systematically tested against a
permutation null to establish whether they exceed chance expectation, nor has
target-family specificity been rigorously tested.

An **MMP square** is four compounds {A, B, C, D} where:
- A -> B applies transformation T1 (e.g. H->F at position 3)
- A -> C applies transformation T2 (e.g. H->CH3 at position 5)
- B -> D applies transformation T2
- C -> D applies transformation T1

The additivity residual for this square is:
  delta = pIC50(A) + pIC50(D) - pIC50(B) - pIC50(C)

delta = 0: perfect additivity
delta > 0: positive cooperativity (both groups together better than predicted)
delta < 0: negative cooperativity (antagonistic effect)

## Parameters

```bash
cat > params.py << 'EOF'
# Target selection — two per major target family
TARGETS = {
    "EGFR":      "CHEMBL203",     # kinase
    "BRAF":      "CHEMBL5145",    # kinase
    "DRD2":      "CHEMBL217",     # GPCR
    "ADRB2":     "CHEMBL210",     # GPCR
    "ESR1":      "CHEMBL206",     # nuclear receptor
    "PPARG":     "CHEMBL235",     # nuclear receptor
    "THROMBIN":  "CHEMBL204",     # protease
    "MMP12":     "CHEMBL4617",    # protease
    "HDAC1":     "CHEMBL325",     # epigenetic
    "BRD4":      "CHEMBL1163125", # epigenetic
}

ACTIVITY_TYPE       = "IC50"
ACTIVITY_UNIT       = "nM"
ACTIVE_THRESHOLD    = 10000   # nM — include up to 10 uM for square density
MIN_COMPOUNDS       = 100     # skip target if fewer curated actives
TANIMOTO_MMP        = 0.85    # similarity threshold for MMP pair membership
MAX_HEAVY_ATOM_DIFF = 8       # max heavy atom count difference for an MMP
N_PERMUTATIONS      = 1000    # shuffles for within-target permutation null
CORR_PERM_R         = 1000    # samples for cross-target correlation test
RANDOM_SEED         = 42
OUTPUT_DIR          = "mmp_output"
EOF
```

> **Important:** Run all commands from the **project root** (directory containing
> `params.py`). Use `python scripts/NN_name.py`. Each script inserts the project
> root into `sys.path` explicitly.

---

## Step 1 — Environment Setup

```bash
pip install chembl-webresource-client==0.10.9 \
            datamol==0.12.3 \
            rdkit==2024.3.1 \
            pandas==2.1.4 \
            numpy==1.26.4 \
            matplotlib==3.9.0 \
            scipy==1.13.0 \
            jinja2==3.1.4 \
            tqdm==4.66.4

mkdir -p mmp_output/data mmp_output/figures scripts
```

**Validation:**
```bash
python -c "import datamol, chembl_webresource_client, rdkit, scipy; print('OK')"
```

---

## Step 2 — Download and Curate Actives

Downloads IC50 data for all 10 targets from ChEMBL. Applies structure
standardisation, deduplication (keep best potency per SMILES), PAINS removal,
and converts IC50 to pIC50.

```python
# scripts/01_download.py
import sys, os, warnings, json, time
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
import datamol as dm
from rdkit.Chem.FilterCatalog import FilterCatalog, FilterCatalogParams
from chembl_webresource_client.new_client import new_client
from params import (TARGETS, ACTIVITY_TYPE, ACTIVITY_UNIT,
                    ACTIVE_THRESHOLD, OUTPUT_DIR)

try:
    dm.disable_logs()
except AttributeError:
    pass

pains_params = FilterCatalogParams()
pains_params.AddCatalog(FilterCatalogParams.FilterCatalogs.PAINS)
catalog = FilterCatalog(pains_params)

def standardize(smi):
    try:
        mol = dm.to_mol(smi, sanitize=True)
        if mol is None:
            return None
        mol = dm.standardize_mol(mol)
        mol = dm.fix_mol(mol)
        return dm.to_smiles(mol)
    except Exception:
        return None

def is_pains(smi):
    mol = dm.to_mol(smi)
    return mol is not None and catalog.HasMatch(mol)

activity_api = new_client.activity
summary = {}

for name, chembl_id in TARGETS.items():
    print(f"\n{'='*50}\n{name} ({chembl_id})")
    records = list(activity_api.filter(
        target_chembl_id=chembl_id,
        standard_type=ACTIVITY_TYPE,
        standard_units=ACTIVITY_UNIT,
    ).only([
        "molecule_chembl_id", "canonical_smiles",
        "standard_value", "standard_type",
    ]))
    df = pd.DataFrame(records)
    if df.empty:
        print(f"  No records found, skipping.")
        summary[name] = {"status": "no_data"}
        continue

    df["standard_value"] = pd.to_numeric(df["standard_value"], errors="coerce")
    df = df.dropna(subset=["canonical_smiles", "standard_value"])
    df = df[df["standard_value"] <= ACTIVE_THRESHOLD].copy()
    df["std_smiles"] = df["canonical_smiles"].apply(standardize)
    df = df.dropna(subset=["std_smiles"])
    # Keep best (lowest IC50) per canonical SMILES
    df = df.sort_values("standard_value")
    df = df.drop_duplicates(subset="std_smiles", keep="first").reset_index(drop=True)
    # PAINS filter
    df["is_pains"] = df["std_smiles"].apply(is_pains)
    n_pains = int(df["is_pains"].sum())
    df = df[~df["is_pains"]].copy().reset_index(drop=True)
    # pIC50
    df["pIC50"] = -np.log10(df["standard_value"] * 1e-9)
    df["target"] = name
    df["target_chembl_id"] = chembl_id

    out = f"{OUTPUT_DIR}/data/{name}_actives.csv"
    df[["molecule_chembl_id", "std_smiles", "standard_value",
        "pIC50", "target", "target_chembl_id"]].to_csv(out, index=False)

    summary[name] = {
        "n_curated":        len(df),
        "n_pains_removed":  n_pains,
        "pic50_mean":       round(float(df["pIC50"].mean()), 3),
        "pic50_std":        round(float(df["pIC50"].std()), 3),
    }
    print(f"  Curated: {len(df)} compounds | mean pIC50: {df['pIC50'].mean():.2f}")
    time.sleep(0.5)

with open(f"{OUTPUT_DIR}/data/download_summary.json", "w") as f:
    json.dump(summary, f, indent=2)
print(json.dumps(summary, indent=2))
```

```bash
python scripts/01_download.py
```

**Validation:** Each target directory must have a `{name}_actives.csv` > 0 rows.

---

## Step 3 — MMP Pair Extraction

Identifies matched molecular pairs using Morgan fingerprint Tanimoto similarity
(>= TANIMOTO_MMP) and heavy atom count constraint (<= MAX_HEAVY_ATOM_DIFF).

```python
# scripts/02_extract_mmps.py
import sys, os, warnings, json
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
from rdkit.Chem import AllChem, DataStructs, Descriptors
import datamol as dm
from params import (TARGETS, OUTPUT_DIR, TANIMOTO_MMP,
                    MAX_HEAVY_ATOM_DIFF, MIN_COMPOUNDS)

def get_fp(smi):
    mol = dm.to_mol(smi)
    if mol is None:
        return None
    return AllChem.GetMorganFingerprintAsBitVect(mol, 2, 2048)

def get_ha(smi):
    mol = dm.to_mol(smi)
    return Descriptors.HeavyAtomCount(mol) if mol else 0

all_mmp_counts = {}

for name in TARGETS:
    path = f"{OUTPUT_DIR}/data/{name}_actives.csv"
    if not os.path.exists(path):
        print(f"  {name}: no data file, skipping")
        continue

    df = pd.read_csv(path)
    if len(df) < MIN_COMPOUNDS:
        print(f"  {name}: only {len(df)} compounds (< {MIN_COMPOUNDS}), skipping")
        continue

    print(f"\n{name}: {len(df)} compounds -> extracting MMP pairs ...")

    smiles = df["std_smiles"].tolist()
    pic50s = df["pIC50"].tolist()
    ids    = df["molecule_chembl_id"].tolist()

    fps = [get_fp(s) for s in smiles]
    has = [get_ha(s) for s in smiles]
    valid = [(i, fps[i], has[i]) for i in range(len(fps)) if fps[i] is not None]

    pairs = []
    for i in range(len(valid)):
        idx_i, fp_i, ha_i = valid[i]
        sims = DataStructs.BulkTanimotoSimilarity(
            fp_i, [v[1] for v in valid[i+1:]])
        for k, sim in enumerate(sims):
            j = i + 1 + k
            idx_j, fp_j, ha_j = valid[j]
            if sim < TANIMOTO_MMP:
                continue
            if abs(ha_i - ha_j) > MAX_HEAVY_ATOM_DIFF:
                continue
            pairs.append({
                "id_A":        ids[idx_i],
                "id_B":        ids[idx_j],
                "smiles_A":    smiles[idx_i],
                "smiles_B":    smiles[idx_j],
                "pIC50_A":     round(pic50s[idx_i], 4),
                "pIC50_B":     round(pic50s[idx_j], 4),
                "delta_pIC50": round(pic50s[idx_j] - pic50s[idx_i], 4),
                "tanimoto":    round(sim, 4),
                "ha_diff":     abs(ha_i - ha_j),
            })

    mmp_df = pd.DataFrame(pairs)
    out = f"{OUTPUT_DIR}/data/{name}_mmps.csv"
    mmp_df.to_csv(out, index=False)
    all_mmp_counts[name] = len(mmp_df)
    print(f"  {name}: {len(mmp_df)} MMP pairs -> {out}")

with open(f"{OUTPUT_DIR}/data/mmp_counts.json", "w") as f:
    json.dump(all_mmp_counts, f, indent=2)
print(json.dumps(all_mmp_counts, indent=2))
```

```bash
python scripts/02_extract_mmps.py
```

**Validation:** `mmp_output/data/mmp_counts.json` must show > 0 pairs for at
least 6 targets.

---

## Step 4 — MMP Square Detection

Finds all four-compound cycles {A, B, C, D}. For each square computes the
additivity residual: delta = pIC50(A) + pIC50(D) - pIC50(B) - pIC50(C)

```python
# scripts/03_find_squares.py
import sys, os, warnings, json
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
from params import TARGETS, OUTPUT_DIR


def find_squares(mmp_df):
    """
    Find all MMP squares from an undirected MMP pair table.
    Uses adjacency list for O(n * deg^2) complexity.
    """
    # Build adjacency list and potency lookup
    adj = {}
    pic50_lookup = {}  # id -> pIC50

    for _, row in mmp_df.iterrows():
        a, b = row["id_A"], row["id_B"]
        adj.setdefault(a, set()).add(b)
        adj.setdefault(b, set()).add(a)
        pic50_lookup[a] = row["pIC50_A"]
        pic50_lookup[b] = row["pIC50_B"]

    pair_exists = set()
    for _, row in mmp_df.iterrows():
        pair_exists.add(frozenset([row["id_A"], row["id_B"]]))

    squares = []
    seen_squares = set()

    for A in adj:
        nbrs_A = list(adj[A])
        for i in range(len(nbrs_A)):
            B = nbrs_A[i]
            for j in range(i + 1, len(nbrs_A)):
                C = nbrs_A[j]
                if B == C:
                    continue
                # D must be neighbour of both B and C, not A
                D_candidates = adj.get(B, set()) & adj.get(C, set()) - {A}
                for D in D_candidates:
                    sq_key = frozenset([A, B, C, D])
                    if sq_key in seen_squares:
                        continue
                    seen_squares.add(sq_key)

                    pA = pic50_lookup.get(A)
                    pB = pic50_lookup.get(B)
                    pC = pic50_lookup.get(C)
                    pD = pic50_lookup.get(D)
                    if any(x is None for x in [pA, pB, pC, pD]):
                        continue

                    # Additivity residual
                    delta = pA + pD - pB - pC
                    squares.append({
                        "id_A": A, "id_B": B, "id_C": C, "id_D": D,
                        "pIC50_A": round(pA, 4),
                        "pIC50_B": round(pB, 4),
                        "pIC50_C": round(pC, 4),
                        "pIC50_D": round(pD, 4),
                        "additivity_residual": round(delta, 4),
                        "abs_residual":        round(abs(delta), 4),
                    })

    return pd.DataFrame(squares)


square_stats = {}

for name in TARGETS:
    mmp_path = f"{OUTPUT_DIR}/data/{name}_mmps.csv"
    if not os.path.exists(mmp_path):
        continue
    mmp_df = pd.read_csv(mmp_path)
    if len(mmp_df) < 5:
        print(f"  {name}: too few MMP pairs, skipping")
        continue

    print(f"\n{name}: finding squares from {len(mmp_df)} pairs ...")
    sq_df = find_squares(mmp_df)
    out = f"{OUTPUT_DIR}/data/{name}_squares.csv"
    sq_df.to_csv(out, index=False)

    if len(sq_df) == 0:
        print(f"  {name}: no squares found")
        square_stats[name] = {"n_squares": 0}
        continue

    mean_abs    = float(sq_df["abs_residual"].mean())
    frac_nonadd = float((sq_df["abs_residual"] > 0.5).mean())

    square_stats[name] = {
        "n_squares":                len(sq_df),
        "mean_abs_residual":        round(mean_abs, 4),
        "std_abs_residual":         round(float(sq_df["abs_residual"].std()), 4),
        "frac_nonadditivity_gt05":  round(frac_nonadd, 4),
        "nonadditivity_index":      round(mean_abs, 4),
    }
    print(f"  {name}: {len(sq_df)} squares | mean |delta| = {mean_abs:.3f} | "
          f"frac >0.5: {frac_nonadd:.3f}")

with open(f"{OUTPUT_DIR}/data/square_stats.json", "w") as f:
    json.dump(square_stats, f, indent=2)
print(json.dumps(square_stats, indent=2))
```

```bash
python scripts/03_find_squares.py
```

**Expected time:** 2–10 minutes depending on MMP pair counts.
**Validation:** `mmp_output/data/square_stats.json` must contain `n_squares` > 0
for at least 4 targets. Fewer squares for smaller datasets is expected.

---

## Step 5 — Within-Target Permutation Test

Shuffles pIC50 labels across compounds 1,000 times and recomputes the
non-additivity index for each shuffle, building a null distribution.
Tests whether the real index is significantly more extreme than chance.

```python
# scripts/04_permutation_test.py
import sys, os, warnings, json
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
from params import TARGETS, OUTPUT_DIR, N_PERMUTATIONS, RANDOM_SEED

rng = np.random.default_rng(RANDOM_SEED)


def permuted_nonadditivity(sq_df, act_df, n_perm, rng):
    """
    Shuffle pIC50 labels n_perm times, recompute mean |delta| each time.
    The square topology is fixed; only potency values are shuffled.
    This tests whether the real non-additivity exceeds random label assignment.
    """
    pic50_map = dict(zip(act_df["molecule_chembl_id"], act_df["pIC50"]))
    ids       = list(pic50_map.keys())
    pic50_arr = np.array([pic50_map[i] for i in ids])

    null_indices = []
    for _ in range(n_perm):
        shuffled   = rng.permutation(pic50_arr)
        perm_map   = dict(zip(ids, shuffled))
        residuals  = []
        for _, row in sq_df.iterrows():
            pA = perm_map.get(row["id_A"])
            pB = perm_map.get(row["id_B"])
            pC = perm_map.get(row["id_C"])
            pD = perm_map.get(row["id_D"])
            if any(x is None for x in [pA, pB, pC, pD]):
                continue
            residuals.append(abs(pA + pD - pB - pC))
        null_indices.append(float(np.mean(residuals)) if residuals else 0.0)

    return np.array(null_indices)


perm_results = {}

for name in TARGETS:
    sq_path  = f"{OUTPUT_DIR}/data/{name}_squares.csv"
    act_path = f"{OUTPUT_DIR}/data/{name}_actives.csv"
    if not os.path.exists(sq_path) or not os.path.exists(act_path):
        continue

    sq_df  = pd.read_csv(sq_path)
    act_df = pd.read_csv(act_path)
    if len(sq_df) == 0:
        continue

    real_index = float(sq_df["abs_residual"].mean())
    print(f"\n{name}: real index = {real_index:.4f}, "
          f"running {N_PERMUTATIONS} permutations ...")

    null_dist = permuted_nonadditivity(sq_df, act_df, N_PERMUTATIONS, rng)
    null_mean = float(null_dist.mean())
    null_std  = float(null_dist.std())
    p_value   = float((null_dist >= real_index).mean())
    z_score   = (real_index - null_mean) / (null_std + 1e-12)

    perm_results[name] = {
        "real_nonadditivity_index": round(real_index, 4),
        "null_mean":     round(null_mean, 4),
        "null_std":      round(null_std, 4),
        "null_min":      round(float(null_dist.min()), 4),
        "null_max":      round(float(null_dist.max()), 4),
        "null_pct_025":  round(float(np.percentile(null_dist, 2.5)), 4),
        "null_pct_975":  round(float(np.percentile(null_dist, 97.5)), 4),
        "z_score":       round(z_score, 4),
        "p_value":       round(p_value, 4),
        "n_squares":     len(sq_df),
        "significant_p05": p_value < 0.05,
    }
    print(f"  z = {z_score:.3f}, p = {p_value:.4f}")

with open(f"{OUTPUT_DIR}/data/permutation_results.json", "w") as f:
    json.dump(perm_results, f, indent=2)
print(json.dumps(perm_results, indent=2))
```

```bash
python scripts/04_permutation_test.py
```

**Expected time:** ~5–15 minutes (1,000 shuffles × up to 10 targets).
**Validation:** `permutation_results.json` must contain at least 4 targets.

---

## Step 6 — Cross-Target Correlation Permutation Test

Tests whether non-additivity clusters by target family (kinase, GPCR, nuclear
receptor, protease, epigenetic) more than random family label assignment would
predict. This is the second-level permutation test: for 1,000 random permutations
of family labels, recompute the Spearman rho(family rank, non-additivity index)
and compare to the real rho.

```python
# scripts/05_cross_target_test.py
import sys, os, warnings, json
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
warnings.filterwarnings("ignore")
import numpy as np
from scipy import stats
from params import OUTPUT_DIR, RANDOM_SEED, CORR_PERM_R

rng = np.random.default_rng(RANDOM_SEED)

TARGET_FAMILY = {
    "EGFR":     "kinase",
    "BRAF":     "kinase",
    "DRD2":     "GPCR",
    "ADRB2":    "GPCR",
    "ESR1":     "nuclear_receptor",
    "PPARG":    "nuclear_receptor",
    "THROMBIN": "protease",
    "MMP12":    "protease",
    "HDAC1":    "epigenetic",
    "BRD4":     "epigenetic",
}
FAMILY_INT = {f: i for i, f in enumerate(
    ["kinase", "GPCR", "nuclear_receptor", "protease", "epigenetic"])}

with open(f"{OUTPUT_DIR}/data/permutation_results.json") as f:
    perm_results = json.load(f)

# Only targets with valid squares
valid = [t for t in TARGET_FAMILY
         if t in perm_results and perm_results[t]["n_squares"] > 0]

if len(valid) < 4:
    print(f"WARNING: only {len(valid)} targets with squares. "
          "Cross-target test has limited power.")

real_indices  = np.array([perm_results[t]["real_nonadditivity_index"]
                           for t in valid])
family_labels = np.array([FAMILY_INT[TARGET_FAMILY[t]] for t in valid])

# Real Spearman rho
real_rho, real_p = stats.spearmanr(family_labels, real_indices)
print(f"\nReal Spearman rho(family, nonadditivity) = {real_rho:.4f} "
      f"(p = {real_p:.4f}, n = {len(valid)} targets)")

# Family means
families = sorted(set(TARGET_FAMILY[t] for t in valid))
family_means = {}
for fam in families:
    vals = [perm_results[t]["real_nonadditivity_index"]
            for t in valid if TARGET_FAMILY[t] == fam]
    if vals:
        family_means[fam] = round(float(np.mean(vals)), 4)
print("Family means:", family_means)

# Cross-target correlation permutation test
# Shuffle family labels R times, recompute rho, build null distribution
null_rhos = []
for _ in range(CORR_PERM_R):
    shuffled_labels = rng.permutation(family_labels)
    rho_null, _ = stats.spearmanr(shuffled_labels, real_indices)
    null_rhos.append(float(rho_null))

null_rhos = np.array(null_rhos)
p_corr    = float((np.abs(null_rhos) >= abs(real_rho)).mean())

print(f"\nCross-target correlation permutation test (R={CORR_PERM_R}):")
print(f"  Null mean = {null_rhos.mean():.4f}, SD = {null_rhos.std():.4f}")
print(f"  p-value (two-tailed) = {p_corr:.4f}")

# One-way ANOVA: between-family vs within-family variance
groups = [
    [perm_results[t]["real_nonadditivity_index"]
     for t in valid if TARGET_FAMILY[t] == fam]
    for fam in families
    if sum(1 for t in valid if TARGET_FAMILY[t] == fam) >= 2
]
if len(groups) >= 2:
    f_stat, p_anova = stats.f_oneway(*groups)
    print(f"  ANOVA: F = {f_stat:.3f}, p = {p_anova:.4f}")
else:
    f_stat = p_anova = float("nan")
    print("  ANOVA: insufficient groups")

result = {
    "targets_analysed":   valid,
    "n_targets":          len(valid),
    "real_spearman_rho":  round(real_rho, 4),
    "real_spearman_p":    round(real_p, 4),
    "family_means":       family_means,
    "null_mean_rho":      round(float(null_rhos.mean()), 4),
    "null_std_rho":       round(float(null_rhos.std()), 4),
    "null_pct_025":       round(float(np.percentile(null_rhos, 2.5)), 4),
    "null_pct_975":       round(float(np.percentile(null_rhos, 97.5)), 4),
    "p_value_corr_perm":  round(p_corr, 4),
    "f_stat_anova":       round(f_stat, 4) if not np.isnan(f_stat) else None,
    "p_value_anova":      round(p_anova, 4) if not np.isnan(p_anova) else None,
    "R":                  CORR_PERM_R,
}

with open(f"{OUTPUT_DIR}/data/cross_target_results.json", "w") as f:
    json.dump(result, f, indent=2)
print(json.dumps(result, indent=2))
```

```bash
python scripts/05_cross_target_test.py
```

**Validation:** `cross_target_results.json` must contain `real_spearman_rho`
and `p_value_corr_perm`.

---

## Step 7 — Figures

```python
# scripts/06_figures.py
import sys, os, warnings, json
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
warnings.filterwarnings("ignore")
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import pandas as pd
import numpy as np
from params import OUTPUT_DIR, TARGETS, RANDOM_SEED

with open(f"{OUTPUT_DIR}/data/permutation_results.json") as f:
    perm = json.load(f)
with open(f"{OUTPUT_DIR}/data/cross_target_results.json") as f:
    cross = json.load(f)

TARGET_FAMILY = {
    "EGFR": "kinase",     "BRAF": "kinase",
    "DRD2": "GPCR",       "ADRB2": "GPCR",
    "ESR1": "nuclear_receptor", "PPARG": "nuclear_receptor",
    "THROMBIN": "protease",     "MMP12": "protease",
    "HDAC1": "epigenetic",      "BRD4": "epigenetic",
}
FAMILY_COLOUR = {
    "kinase":           "#1565C0",
    "GPCR":             "#E53935",
    "nuclear_receptor": "#2E7D32",
    "protease":         "#F57F17",
    "epigenetic":       "#6A1B9A",
}

valid = [t for t in TARGETS if t in perm and perm[t]["n_squares"] > 0]

# Figure 1: Forest plot
fig, ax = plt.subplots(figsize=(10, max(4, len(valid) * 0.7)))
sorted_tgts = sorted(valid, key=lambda t: perm[t]["real_nonadditivity_index"])
for i, tgt in enumerate(sorted_tgts):
    r   = perm[tgt]
    col = FAMILY_COLOUR[TARGET_FAMILY[tgt]]
    ax.plot(r["real_nonadditivity_index"], i, "o", color=col, ms=9, zorder=3)
    ax.barh(i, r["null_pct_975"] - r["null_pct_025"],
            left=r["null_pct_025"], height=0.4,
            color=col, alpha=0.25, zorder=2)
ax.set_yticks(range(len(sorted_tgts)))
ax.set_yticklabels([f"{t} ({TARGET_FAMILY[t]})" for t in sorted_tgts])
ax.set_xlabel("Non-Additivity Index (mean |delta| in pIC50 units)")
ax.set_title("Per-Target Non-Additivity Index\n"
             "(dot = real, bar = permutation null 95% CI)")
patches = [mpatches.Patch(color=v, label=k, alpha=0.8)
           for k, v in FAMILY_COLOUR.items()
           if any(TARGET_FAMILY[t] == k for t in valid)]
ax.legend(handles=patches, fontsize=8, loc="lower right")
plt.tight_layout()
plt.savefig(f"{OUTPUT_DIR}/figures/forest_plot.png", dpi=150)
print("Saved forest_plot.png")

# Figure 2: Residual distributions
n_v = len(valid)
fig, axes = plt.subplots(1, n_v, figsize=(max(12, n_v * 2.5), 5), sharey=True)
if n_v == 1:
    axes = [axes]
for ax, tgt in zip(axes, sorted(valid)):
    sq_path = f"{OUTPUT_DIR}/data/{tgt}_squares.csv"
    if not os.path.exists(sq_path):
        continue
    sq_df = pd.read_csv(sq_path)
    col   = FAMILY_COLOUR[TARGET_FAMILY[tgt]]
    ax.hist(sq_df["additivity_residual"], bins=20, color=col,
            alpha=0.75, edgecolor="white", density=True)
    ax.axvline(0, color="black", ls="--", lw=1)
    ax.set_title(f"{tgt}\n(n={len(sq_df)})", fontsize=9)
    ax.set_xlabel("delta (pIC50)", fontsize=8)
axes[0].set_ylabel("Density")
plt.suptitle("Additivity Residual Distributions per Target", fontsize=11)
plt.tight_layout()
plt.savefig(f"{OUTPUT_DIR}/figures/residual_distributions.png", dpi=150)
print("Saved residual_distributions.png")

# Figure 3: Family mean comparison
fam_means = cross.get("family_means", {})
if fam_means:
    fig, ax = plt.subplots(figsize=(8, 4))
    fams  = sorted(fam_means.keys(), key=lambda f: -fam_means[f])
    vals  = [fam_means[f] for f in fams]
    cols  = [FAMILY_COLOUR.get(f, "#999999") for f in fams]
    bars  = ax.bar(fams, vals, color=cols, alpha=0.85, width=0.5)
    for bar, v in zip(bars, vals):
        ax.text(bar.get_x() + bar.get_width()/2, v + 0.003,
                f"{v:.3f}", ha="center", va="bottom", fontsize=9)
    ax.set_ylabel("Mean Non-Additivity Index")
    ax.set_title("Non-Additivity Index by Target Family")
    ax.tick_params(axis="x", rotation=15)
    plt.tight_layout()
    plt.savefig(f"{OUTPUT_DIR}/figures/family_comparison.png", dpi=150)
    print("Saved family_comparison.png")

# Figure 4: Correlation permutation null distribution
rng_plot = np.random.default_rng(RANDOM_SEED)
null_approx = rng_plot.normal(
    cross["null_mean_rho"], cross["null_std_rho"], 2000)
fig, ax = plt.subplots(figsize=(8, 4))
ax.hist(null_approx, bins=40, color="#90A4AE", alpha=0.8,
        density=True, label=f"Null distribution (R={cross['R']})")
ax.axvline(cross["real_spearman_rho"], color="#E53935", lw=2.5,
           label=f"Real rho = {cross['real_spearman_rho']:.3f} "
                 f"(p = {cross['p_value_corr_perm']:.3f})")
ax.set_xlabel("Spearman rho (family rank vs non-additivity index)")
ax.set_ylabel("Density")
ax.set_title("Cross-Target Correlation Permutation Test")
ax.legend()
plt.tight_layout()
plt.savefig(f"{OUTPUT_DIR}/figures/correlation_permutation.png", dpi=150)
print("Saved correlation_permutation.png")
```

```bash
python scripts/06_figures.py
```

**Validation:** All four PNG files must exist in `mmp_output/figures/`.

---

## Step 8 — Compile Report

```python
# scripts/07_report.py
import sys, os, json, base64, pathlib
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
import pandas as pd
from jinja2 import Template
from params import OUTPUT_DIR

def img_b64(path):
    p = pathlib.Path(path)
    if not p.exists():
        return ""
    return base64.b64encode(p.read_bytes()).decode()

perm     = json.load(open(f"{OUTPUT_DIR}/data/permutation_results.json"))
cross    = json.load(open(f"{OUTPUT_DIR}/data/cross_target_results.json"))
dl_sum   = json.load(open(f"{OUTPUT_DIR}/data/download_summary.json"))

TARGET_FAMILY = {
    "EGFR": "kinase",     "BRAF": "kinase",
    "DRD2": "GPCR",       "ADRB2": "GPCR",
    "ESR1": "nuclear_receptor", "PPARG": "nuclear_receptor",
    "THROMBIN": "protease",     "MMP12": "protease",
    "HDAC1": "epigenetic",      "BRD4": "epigenetic",
}

valid = [t for t in TARGET_FAMILY if t in perm and perm[t]["n_squares"] > 0]
rows = sorted([{
    "target":      t,
    "family":      TARGET_FAMILY[t],
    "n_compounds": dl_sum.get(t, {}).get("n_curated", "N/A"),
    "n_squares":   perm[t]["n_squares"],
    "real_index":  perm[t]["real_nonadditivity_index"],
    "null_mean":   perm[t]["null_mean"],
    "null_95_lo":  perm[t]["null_pct_025"],
    "null_95_hi":  perm[t]["null_pct_975"],
    "z_score":     perm[t]["z_score"],
    "p_value":     perm[t]["p_value"],
    "significant": perm[t]["significant_p05"],
} for t in valid], key=lambda r: -r["real_index"])

# Pre-sort family means for Jinja2
family_means_sorted = sorted(
    cross.get("family_means", {}).items(),
    key=lambda x: -x[1]
)

TEMPLATE = """<!DOCTYPE html><html><head><meta charset="utf-8">
<title>MMP SAR Non-Additivity Audit</title>
<style>
  body{font-family:Georgia,serif;max-width:980px;margin:40px auto;line-height:1.6;color:#222}
  h1{color:#1a237e} h2{color:#283593;border-bottom:1px solid #ccc;padding-bottom:4px}
  table{border-collapse:collapse;width:100%}
  th,td{border:1px solid #ddd;padding:8px;font-size:0.9em}
  th{background:#e8eaf6}
  img{max-width:100%;border:1px solid #eee;border-radius:4px;margin:8px 0}
  .stat{font-size:1.8em;font-weight:bold;color:#1565c0}
  .label{font-size:0.82em;color:#555}
  .grid{display:grid;grid-template-columns:repeat(3,1fr);gap:14px;margin:20px 0}
  .card{background:#f5f5f5;border-radius:6px;padding:14px;text-align:center}
  .sig{color:#2E7D32;font-weight:bold} .ns{color:#999}
  .highlight{background:#e3f2fd;padding:12px;
             border-left:4px solid #1565c0;margin:12px 0}
</style></head><body>
<h1>MMP Transitivity: SAR Non-Additivity Across Target Families</h1>

<div class="grid">
  <div class="card"><div class="stat">{{ rows|length }}</div>
    <div class="label">Targets with MMP squares</div></div>
  <div class="card"><div class="stat">{{ total_squares }}</div>
    <div class="label">Total MMP squares</div></div>
  <div class="card"><div class="stat">{{ cross.real_spearman_rho }}</div>
    <div class="label">Cross-family rho<br>(p = {{ cross.p_value_corr_perm }})</div></div>
</div>

<div class="highlight">
  <strong>Key finding:</strong>
  Non-additivity indices range from
  {{ "%.3f"|format(rows[-1].real_index) }} to
  {{ "%.3f"|format(rows[0].real_index) }} pIC50 units across targets.
  Cross-target correlation permutation test: rho = {{ cross.real_spearman_rho }},
  p = {{ cross.p_value_corr_perm }}
  {% if cross.p_value_corr_perm < 0.05 %}
  — non-additivity clusters significantly by target family.
  {% else %}
  — insufficient evidence of family-level clustering at alpha = 0.05.
  {% endif %}
</div>

<h2>1. Per-Target Non-Additivity (Forest Plot)</h2>
<img src="data:image/png;base64,{{ forest_b64 }}" alt="Forest plot">

<h2>2. Summary Table</h2>
<table>
  <tr><th>Target</th><th>Family</th><th>Compounds</th><th>Squares</th>
      <th>Real index</th><th>Null mean</th><th>Null 95% CI</th>
      <th>z</th><th>p-value</th></tr>
  {% for r in rows %}
  <tr>
    <td>{{ r.target }}</td><td>{{ r.family }}</td>
    <td>{{ r.n_compounds }}</td><td>{{ r.n_squares }}</td>
    <td>{{ "%.4f"|format(r.real_index) }}</td>
    <td>{{ "%.4f"|format(r.null_mean) }}</td>
    <td>[{{ "%.3f"|format(r.null_95_lo) }}, {{ "%.3f"|format(r.null_95_hi) }}]</td>
    <td>{{ "%.2f"|format(r.z_score) }}</td>
    <td class="{{ 'sig' if r.significant else 'ns' }}">
      {{ "%.4f"|format(r.p_value) }}{{ '*' if r.significant else '' }}</td>
  </tr>
  {% endfor %}
</table>

<h2>3. Additivity Residual Distributions</h2>
<img src="data:image/png;base64,{{ dist_b64 }}" alt="Residual distributions">

<h2>4. Non-Additivity by Target Family</h2>
<img src="data:image/png;base64,{{ fam_b64 }}" alt="Family comparison">
{% if family_means_sorted %}
<table>
  <tr><th>Family</th><th>Mean non-additivity index</th></tr>
  {% for fam, val in family_means_sorted %}
  <tr><td>{{ fam }}</td><td>{{ val }}</td></tr>
  {% endfor %}
</table>
{% endif %}

<h2>5. Cross-Target Correlation Permutation Test</h2>
<img src="data:image/png;base64,{{ corr_b64 }}" alt="Correlation permutation">
<table>
  <tr><th>Metric</th><th>Value</th></tr>
  <tr><td>Real Spearman rho</td><td>{{ cross.real_spearman_rho }}</td></tr>
  <tr><td>Null mean rho</td><td>{{ cross.null_mean_rho }}</td></tr>
  <tr><td>Null SD</td><td>{{ cross.null_std_rho }}</td></tr>
  <tr><td>Null 95% CI</td>
      <td>[{{ cross.null_pct_025 }}, {{ cross.null_pct_975 }}]</td></tr>
  <tr><td>p-value (permutation, two-tailed)</td>
      <td>{{ cross.p_value_corr_perm }}</td></tr>
  {% if cross.p_value_anova %}
  <tr><td>ANOVA F-statistic</td><td>{{ cross.f_stat_anova }}</td></tr>
  <tr><td>ANOVA p-value</td><td>{{ cross.p_value_anova }}</td></tr>
  {% endif %}
</table>
</body></html>"""

html = Template(TEMPLATE).render(
    rows                = rows,
    total_squares       = sum(r["n_squares"] for r in rows),
    cross               = cross,
    family_means_sorted = family_means_sorted,
    forest_b64 = img_b64(f"{OUTPUT_DIR}/figures/forest_plot.png"),
    dist_b64   = img_b64(f"{OUTPUT_DIR}/figures/residual_distributions.png"),
    fam_b64    = img_b64(f"{OUTPUT_DIR}/figures/family_comparison.png"),
    corr_b64   = img_b64(f"{OUTPUT_DIR}/figures/correlation_permutation.png"),
)

out = f"{OUTPUT_DIR}/report.html"
pathlib.Path(out).write_text(html, encoding="utf-8")
print(f"Report saved to {out}")
```

```bash
python scripts/07_report.py
```

**Expected final output:** `mmp_output/report.html` — open in any browser.

---

## Expected Outputs

```
mmp_output/
├── data/
│   ├── {TARGET}_actives.csv       # curated actives per target
│   ├── {TARGET}_mmps.csv          # MMP pairs per target
│   ├── {TARGET}_squares.csv       # MMP squares with additivity residuals
│   ├── download_summary.json
│   ├── mmp_counts.json
│   ├── square_stats.json
│   ├── permutation_results.json   # within-target permutation null
│   └── cross_target_results.json  # cross-target correlation test
├── figures/
│   ├── forest_plot.png
│   ├── residual_distributions.png
│   ├── family_comparison.png
│   └── correlation_permutation.png
└── report.html                    <- primary deliverable
```

---

## Reproducing with Different Targets

Edit `params.py`:
- Replace any entry in `TARGETS` with a different ChEMBL target ID
- Lower `ACTIVE_THRESHOLD` to 1000 nM for tighter actives-only analysis
- Increase `MIN_COMPOUNDS` to require richer data before including a target
- Adjust `TANIMOTO_MMP` (0.7 for more pairs, 0.9 for tighter matches)
- All outputs regenerate automatically