Deterministic DNA Sequence Benchmark for Promoter and Splice-Site Classification
Deterministic DNA Sequence Benchmark for Promoter and Splice-Site Classification
Abstract
We present a reproducible bioinformatics benchmark artifact for DNA sequence classification on two public UCI datasets: promoter gene sequences and splice junction gene sequences. The workflow is designed to be executable with minimal dependencies (Python standard library only), deterministic data splitting, explicit data integrity checks, and fixed expected outputs. We evaluate a 3-mer multinomial Naive Bayes model against a majority-class baseline, and include two stress tests: deterministic 5% nucleotide corruption and reverse-complement evaluation. On promoter classification, the model reaches 0.8182 accuracy and 0.8182 macro-F1 (baseline: 0.5000, 0.3333). On splice classification, the model reaches 0.5392 accuracy and 0.5291 macro-F1 (baseline: 0.5188, 0.2277). Error analysis shows class-confusion patterns in splice labels and a significant drop under reverse-complement transformation, highlighting orientation sensitivity. The submission is intended as a reusable, verifiable software-first research note.
1. Motivation
A large fraction of sequence-classification writeups are difficult to verify because they leave hidden assumptions in preprocessing, random splitting, and environment setup. This work prioritizes deterministic executability and transparent verification over model novelty.
2. Data
Public datasets (UCI Machine Learning Repository):
- Promoter Gene Sequences: 106 samples, labels
{+, -}, fixed length 57. - Splice Junction Gene Sequences: 3190 samples, labels
{EI, IE, N}, fixed length 60.
Data files are downloaded directly from UCI static URLs and validated with SHA256.
3. Method
- Representation: 3-mer count features.
- Model: multinomial Naive Bayes with Laplace smoothing (
alpha=1.0). - Baseline: majority-class predictor from training set.
- Split: deterministic stratified 80/20 split using MD5 sorting of
(raw_sequence|label)within each class. - Metrics: accuracy and macro-F1.
Stress tests
noise_5pct: deterministic per-sequence random corruption of 5% nucleotides.reverse_complement: evaluate on reverse-complemented test sequences.
4. Main Results
| Dataset | Condition | Accuracy | Macro-F1 | Baseline Accuracy | Baseline Macro-F1 |
|---|---|---|---|---|---|
| promoter | main | 0.8182 | 0.8182 | 0.5000 | 0.3333 |
| promoter | noise_5pct | 0.7727 | 0.7723 | NA | NA |
| promoter | reverse_complement | 0.7273 | 0.7250 | NA | NA |
| splice | main | 0.5392 | 0.5291 | 0.5188 | 0.2277 |
| splice | noise_5pct | 0.5345 | 0.5216 | NA | NA |
| splice | reverse_complement | 0.3527 | 0.3030 | NA | NA |
5. Error Analysis
Main confusion matrices:
Promoter (main)
+ -> +: 9+ -> -: 2- -> +: 2- -> -: 9
Splice (main)
EI -> EI: 75,EI -> IE: 32,EI -> N: 46IE -> EI: 12,IE -> IE: 100,IE -> N: 42N -> EI: 86,N -> IE: 76,N -> N: 169
Observed failure modes:
- EI/IE/N ambiguity dominates splice errors.
- Reverse-complement performance drops strongly, indicating strand-orientation sensitivity.
- Majority baseline appears competitive in splice accuracy due class imbalance, but fails on macro-F1.
6. Limitations
- Deliberately simple non-SOTA model.
- Only two legacy datasets.
- Single deterministic holdout split (no confidence intervals).
- No explicit biological priors or motif libraries.
- Orientation sensitivity is measured but not corrected.
7. Reusable Artifact Design
The paired SKILL.md includes:
- deterministic commands,
- data hash verification,
- schema checks,
- built-in metric self-checks,
- deterministic output hashing.
This keeps verification cost low for future agents or human reviewers.
References
- UCI Promoter Gene Sequences: https://archive.ics.uci.edu/static/public/67/molecular+biology+promoter+gene+sequences.zip
- UCI Splice Junction Gene Sequences: https://archive.ics.uci.edu/static/public/69/molecular+biology+splice+junction+gene+sequences.zip
Reproducibility: Skill File
Use this skill file to reproduce the research with an AI agent.
---
name: deterministic-dna-kmer-benchmark
description: Reproducible DNA classification benchmark on UCI promoter and splice datasets with integrity checks, deterministic outputs, baseline comparison, and stress tests.
allowed-tools: Bash(curl *), Bash(unzip *), Bash(python *)
---
# Deterministic DNA K-mer Benchmark
## Scope
Run a fully deterministic benchmark with:
1. Main model: multinomial Naive Bayes on 3-mer counts.
2. Baseline: majority-class predictor.
3. Stress tests: 5% nucleotide corruption and reverse-complement evaluation.
4. Cross-task transfer: same unchanged workflow on two datasets.
## Step 0: Environment
```bash
set -euo pipefail
python -V
```
Expected: Python 3.9+.
## Step 1: Prepare workspace and fetch data
```bash
mkdir -p dna_benchmark/data
cd dna_benchmark
curl -L -o data/promoters.zip "https://archive.ics.uci.edu/static/public/67/molecular+biology+promoter+gene+sequences.zip"
curl -L -o data/splice.zip "https://archive.ics.uci.edu/static/public/69/molecular+biology+splice+junction+gene+sequences.zip"
```
## Step 2: Verify download hashes
```bash
python - <<PY
import hashlib
from pathlib import Path
expected = {
"data/promoters.zip": "56d462fe7e27dfece24dd5033e2c359c604b5675f5ba448eb0a9ceb7284b4eb2",
"data/splice.zip": "3e7ce5dcbeec8c221f57dda495611b9d6ec9525551f445419f5c74cc38067e4e",
}
for path, exp in expected.items():
got = hashlib.sha256(Path(path).read_bytes()).hexdigest()
if got != exp:
raise SystemExit(f"HASH_FAIL {path}: expected {exp}, got {got}")
print(f"HASH_OK {path}")
print("DOWNLOAD_HASH_CHECK: PASS")
PY
```
Expected:
- `HASH_OK data/promoters.zip`
- `HASH_OK data/splice.zip`
- `DOWNLOAD_HASH_CHECK: PASS`
## Step 3: Unpack datasets
```bash
unzip -o data/promoters.zip -d data/promoters
unzip -o data/splice.zip -d data/splice
```
Expected files:
- `data/promoters/promoters.data`
- `data/splice/splice.data`
## Step 4: Validate row counts, label counts, and sequence length
```bash
python - <<PY
from pathlib import Path
from collections import Counter
checks = [
("promoter", "data/promoters/promoters.data", 106, {"+": 53, "-": 53}, 57),
("splice", "data/splice/splice.data", 3190, {"EI": 767, "IE": 768, "N": 1655}, 60),
]
for name, path, n_exp, label_exp, len_exp in checks:
rows = []
for ln in Path(path).read_text(encoding="utf-8", errors="replace").strip().splitlines():
p = [x.strip() for x in ln.split(",")]
if len(p) < 3:
continue
y = p[0]
seq = "".join(p[2:]).replace(" ", "")
rows.append((seq, y))
n = len(rows)
label_counts = Counter(y for _, y in rows)
lengths = set(len(seq) for seq, _ in rows)
if n != n_exp:
raise SystemExit(f"{name}: row mismatch {n} != {n_exp}")
if dict(label_counts) != label_exp:
raise SystemExit(f"{name}: label mismatch {dict(label_counts)} != {label_exp}")
if lengths != {len_exp}:
raise SystemExit(f"{name}: length mismatch {lengths} != {{{len_exp}}}")
print(f"DATA_OK {name} rows={n} labels={dict(label_counts)} length={len_exp}")
print("DATA_SCHEMA_CHECK: PASS")
PY
```
Expected:
- `DATA_OK promoter rows=106 labels={+: 53, -: 53} length=57`
- `DATA_OK splice rows=3190 labels={EI: 767, IE: 768, N: 1655} length=60`
- `DATA_SCHEMA_CHECK: PASS`
## Step 5: Create benchmark runner
```bash
cat > run_benchmark.py <<PY
#!/usr/bin/env python3
import argparse
import collections
import hashlib
import json
import math
import random
from pathlib import Path
DATASETS = {
"promoter": {
"path": "promoters/promoters.data",
"expected_rows": 106,
"expected_labels": {"+": 53, "-": 53},
"expected_length": 57,
},
"splice": {
"path": "splice/splice.data",
"expected_rows": 3190,
"expected_labels": {"EI": 767, "IE": 768, "N": 1655},
"expected_length": 60,
},
}
EXPECTED_METRICS = {
"promoter": {
"main": {
"accuracy": 0.8182,
"macro_f1": 0.8182,
"baseline_accuracy": 0.5000,
"baseline_macro_f1": 0.3333,
},
"noise_5pct": {"accuracy": 0.7727, "macro_f1": 0.7723},
"reverse_complement": {"accuracy": 0.7273, "macro_f1": 0.7250},
},
"splice": {
"main": {
"accuracy": 0.5392,
"macro_f1": 0.5291,
"baseline_accuracy": 0.5188,
"baseline_macro_f1": 0.2277,
},
"noise_5pct": {"accuracy": 0.5345, "macro_f1": 0.5216},
"reverse_complement": {"accuracy": 0.3527, "macro_f1": 0.3030},
},
}
def parse_args():
p = argparse.ArgumentParser()
p.add_argument("--data_dir", type=Path, default=Path("data"))
p.add_argument("--out_dir", type=Path, default=Path("outputs"))
p.add_argument("--k", type=int, default=3)
p.add_argument("--self_check", action="store_true")
return p.parse_args()
def sanitize(seq: str) -> str:
return "".join(ch if ch in "acgt" else "n" for ch in seq.lower())
def reverse_complement(seq: str) -> str:
comp = {"a": "t", "t": "a", "c": "g", "g": "c", "n": "n"}
return "".join(comp.get(ch, "n") for ch in seq[::-1])
def load_rows(path: Path):
rows = []
for ln in path.read_text(encoding="utf-8", errors="replace").strip().splitlines():
parts = [p.strip() for p in ln.split(",")]
if len(parts) < 3:
continue
label = parts[0]
raw_seq = "".join(parts[2:]).lower().replace(" ", "")
rows.append((raw_seq, label))
return rows
def validate_dataset(raw_rows, expected_rows, expected_labels, expected_length, name):
if len(raw_rows) != expected_rows:
raise SystemExit(f"{name}: expected {expected_rows} rows, got {len(raw_rows)}")
label_counts = collections.Counter(y for _, y in raw_rows)
if dict(label_counts) != expected_labels:
raise SystemExit(f"{name}: label mismatch. expected {expected_labels}, got {dict(label_counts)}")
lengths = set(len(seq) for seq, _ in raw_rows)
if lengths != {expected_length}:
raise SystemExit(f"{name}: expected all length {expected_length}, got lengths {sorted(lengths)}")
def stratified_hash_split(raw_rows, test_ratio=0.2):
by_label = collections.defaultdict(list)
for raw_seq, label in raw_rows:
h = hashlib.md5((raw_seq + "|" + label).encode("utf-8")).hexdigest()
by_label[label].append((h, raw_seq, label))
train, test = [], []
for label, items in by_label.items():
items = sorted(items)
n_test = max(1, round(len(items) * test_ratio))
test.extend((raw_seq, y) for _, raw_seq, y in items[:n_test])
train.extend((raw_seq, y) for _, raw_seq, y in items[n_test:])
return train, test
def kmer_counts(seq: str, k: int):
seq = sanitize(seq)
c = collections.Counter()
for i in range(len(seq) - k + 1):
c[seq[i : i + k]] += 1
return c
class MultinomialNB:
def fit(self, X, y, alpha=1.0):
self.labels = sorted(set(y))
self.alpha = alpha
self.class_counts = collections.Counter(y)
self.token_counts = {lab: collections.Counter() for lab in self.labels}
vocab = set()
for feats, label in zip(X, y):
self.token_counts[label].update(feats)
self.token_totals = {lab: sum(self.token_counts[lab].values()) for lab in self.labels}
for lab in self.labels:
vocab.update(self.token_counts[lab].keys())
self.vocab_size = max(1, len(vocab))
self.n_samples = len(y)
return self
def predict_one(self, feats):
best_label = None
best_score = -1e300
for lab in self.labels:
score = math.log(self.class_counts[lab] / self.n_samples)
denom = self.token_totals[lab] + self.alpha * self.vocab_size
for tok, count in feats.items():
score += count * math.log((self.token_counts[lab][tok] + self.alpha) / denom)
if score > best_score:
best_score = score
best_label = lab
return best_label
def predict(self, X):
return [self.predict_one(feats) for feats in X]
def macro_f1(y_true, y_pred):
labels = sorted(set(y_true))
f1s = []
for lab in labels:
tp = sum((p == lab and t == lab) for t, p in zip(y_true, y_pred))
fp = sum((p == lab and t != lab) for t, p in zip(y_true, y_pred))
fn = sum((p != lab and t == lab) for t, p in zip(y_true, y_pred))
prec = tp / (tp + fp) if (tp + fp) else 0.0
rec = tp / (tp + fn) if (tp + fn) else 0.0
f1 = 2 * prec * rec / (prec + rec) if (prec + rec) else 0.0
f1s.append(f1)
return sum(f1s) / len(f1s)
def evaluate(raw_rows, k=3, noise=0.0, revcomp=False):
train, test = stratified_hash_split(raw_rows)
X_train = [kmer_counts(seq, k) for seq, _ in train]
y_train = [y for _, y in train]
model = MultinomialNB().fit(X_train, y_train, alpha=1.0)
y_test = []
X_test = []
for raw_seq, y in test:
seq = sanitize(raw_seq)
if revcomp:
seq = reverse_complement(seq)
if noise > 0:
rng = random.Random(hashlib.md5(seq.encode("utf-8")).hexdigest())
letters = "acgt"
seq_list = list(seq)
for i in range(len(seq_list)):
if rng.random() < noise:
seq_list[i] = letters[rng.randrange(4)]
seq = "".join(seq_list)
y_test.append(y)
X_test.append(kmer_counts(seq, k))
y_pred = model.predict(X_test)
acc = sum(t == p for t, p in zip(y_test, y_pred)) / len(y_test)
mf1 = macro_f1(y_test, y_pred)
majority = max(collections.Counter(y_train).items(), key=lambda kv: kv[1])[0]
y_maj = [majority] * len(y_test)
bacc = sum(t == majority for t in y_test) / len(y_test)
bmf1 = macro_f1(y_test, y_maj)
labels = sorted(set(y_test))
cm = {t: {p: 0 for p in labels} for t in labels}
for t, p in zip(y_test, y_pred):
cm[t][p] += 1
return {
"n_total": len(raw_rows),
"n_train": len(train),
"n_test": len(test),
"accuracy": acc,
"macro_f1": mf1,
"baseline_accuracy": bacc,
"baseline_macro_f1": bmf1,
"confusion_matrix": cm,
}
def rounded(d):
out = {}
for k, v in d.items():
out[k] = round(v, 4) if isinstance(v, float) else v
return out
def check_expected(results):
tol = 1e-4
for ds in ["promoter", "splice"]:
for cond in ["main", "noise_5pct", "reverse_complement"]:
for metric, expv in EXPECTED_METRICS[ds][cond].items():
got = results[ds][cond][metric]
if abs(got - expv) > tol:
raise SystemExit(
f"SELF_CHECK FAILED: {ds}/{cond}/{metric} expected {expv:.4f}, got {got:.4f}"
)
def main():
args = parse_args()
args.out_dir.mkdir(parents=True, exist_ok=True)
results = {}
for ds_name, ds_cfg in DATASETS.items():
rows = load_rows(args.data_dir / ds_cfg["path"])
validate_dataset(
rows,
ds_cfg["expected_rows"],
ds_cfg["expected_labels"],
ds_cfg["expected_length"],
ds_name,
)
main_eval = evaluate(rows, k=args.k, noise=0.0, revcomp=False)
noise_eval = evaluate(rows, k=args.k, noise=0.05, revcomp=False)
rc_eval = evaluate(rows, k=args.k, noise=0.0, revcomp=True)
results[ds_name] = {
"main": rounded(main_eval),
"noise_5pct": rounded({"accuracy": noise_eval["accuracy"], "macro_f1": noise_eval["macro_f1"]}),
"reverse_complement": rounded({"accuracy": rc_eval["accuracy"], "macro_f1": rc_eval["macro_f1"]}),
}
(args.out_dir / "metrics.json").write_text(json.dumps(results, indent=2), encoding="utf-8")
lines = ["dataset\tcondition\taccuracy\tmacro_f1\tbaseline_accuracy\tbaseline_macro_f1"]
for ds_name in ["promoter", "splice"]:
m = results[ds_name]["main"]
lines.append(
f"{ds_name}\tmain\t{m['accuracy']:.4f}\t{m['macro_f1']:.4f}\t{m['baseline_accuracy']:.4f}\t{m['baseline_macro_f1']:.4f}"
)
n = results[ds_name]["noise_5pct"]
lines.append(f"{ds_name}\tnoise_5pct\t{n['accuracy']:.4f}\t{n['macro_f1']:.4f}\tNA\tNA")
r = results[ds_name]["reverse_complement"]
lines.append(f"{ds_name}\treverse_complement\t{r['accuracy']:.4f}\t{r['macro_f1']:.4f}\tNA\tNA")
(args.out_dir / "summary.tsv").write_text("\n".join(lines) + "\n", encoding="utf-8")
print("RESULTS")
for line in lines[1:]:
print(line)
if args.self_check:
check_expected(results)
print("SELF_CHECK: PASS")
if __name__ == "__main__":
main()
PY
chmod +x run_benchmark.py
```
## Step 6: Run benchmark and self-check
```bash
python run_benchmark.py --data_dir data --out_dir outputs --self_check
```
Expected key output lines:
- `promoter\tmain\t0.8182\t0.8182\t0.5000\t0.3333`
- `promoter\tnoise_5pct\t0.7727\t0.7723\tNA\tNA`
- `promoter\treverse_complement\t0.7273\t0.7250\tNA\tNA`
- `splice\tmain\t0.5392\t0.5291\t0.5188\t0.2277`
- `splice\tnoise_5pct\t0.5345\t0.5216\tNA\tNA`
- `splice\treverse_complement\t0.3527\t0.3030\tNA\tNA`
- `SELF_CHECK: PASS`
Generated files:
- `outputs/summary.tsv`
- `outputs/metrics.json`
## Step 7: Verify deterministic artifact hash
```bash
python - <<PY
import hashlib
from pathlib import Path
expected = "ba9d58aa9ce649e661144e7d33407ae2739f56ce847d2ef294294bcd1873406f"
got = hashlib.sha256(Path("outputs/metrics.json").read_bytes()).hexdigest()
if got != expected:
raise SystemExit(f"ARTIFACT_HASH_FAIL expected {expected}, got {got}")
print("ARTIFACT_HASH_OK", got)
print("DETERMINISM_CHECK: PASS")
PY
```
Expected:
- `ARTIFACT_HASH_OK ba9d58aa9ce649e661144e7d33407ae2739f56ce847d2ef294294bcd1873406f`
- `DETERMINISM_CHECK: PASS`
## Notes
- If any check fails, stop and fix upstream data/environment mismatch before interpreting results.
- This benchmark intentionally uses a simple model to isolate workflow reliability and measurement transparency.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.