Do Closed-Source Language Models Get Worse After Release? A Longitudinal Study with LiveBench and Arena Signals
Do Closed-Source Language Models Get Worse After Release?
Introduction
People often say that a model gets worse after release. This claim mixes two different ideas:
- objective benchmark change
- subjective user-facing change
We study these two ideas separately and focus on closed-source models as the main target. Open-weight models are used only as a clean objective control group.
Method
Objective change
We use official LiveBench public snapshot tables and track the same public model labels over time. Within each release, we standardize task scores to remove release-level scale differences, then estimate fixed-effects time-trend regressions by model age since release.
Subjective change
We use two Arena-based signals:
- monthly leaderboard rating history from
arena-catalog - monthly pairwise preference win rate from
lmarena-ai/arena-human-preference-140k
The main subjective variable is the leaderboard monthly rating z-score. Pairwise preference is used only as a robustness check.
Joint comparison
For closed-source models, we align model-month observations and test whether the objective signal helps explain the subjective signal.
Data
- Objective source: LiveBench official snapshot tables
- Subjective source A:
lmarena-ai/arena-human-preference-140k - Subjective source B:
arena-cataloghistory through GitHub commits
Current scope:
- 10 LiveBench releases
- 281 objective models
- 21 closed-source main-analysis models
- 44 open-weight objective control models
For executable cold-start reproduction, the submission skill uses a stricter public subset with stable timestamps and reproduces the core direction of the closed-source objective and leaderboard results.
Results
Main regression results from the current run:
- Closed-source objective trend:
beta = -0.1063,p = 0.0000 - Open-weight objective control:
beta = -0.1202,p = 0.0000 - Closed-source subjective leaderboard trend:
beta = -0.0561,p = 0.0000 - Closed-source pairwise trend:
beta = -0.0050,p = 0.4793 - Joint closed-source regression:
objective_scoreis not significant for the main subjective signal
Interpretation
The current evidence supports objective decline for closed-source models in this longitudinal setup. The main subjective leaderboard signal also declines. However, pairwise preference is weaker, and the month-level link from objective to subjective change is not stable. So the safest conclusion is:
Closed-source models show objective decline, but subjective decline is not uniform across subjective measures.
Limits
- closed-source backends can still change without full visibility
- arena leaderboard history is sparse
- pairwise preference covers a shorter time window
- benchmark mix and difficulty can still shift over time
Reproducibility
This submission includes a runnable SKILL.md, fixed output paths, a reproducibility check script, and a LaTeX note. The skill is designed for Codex execution. The executable contract is intentionally bounded: it reproduces the core closed-source objective trend and the main leaderboard-based subjective trend from fully public sources, while pairwise preference remains a separate robustness result in the paper.
Reproducibility: Skill File
Use this skill file to reproduce the research with an AI agent.
---
name: codex-closed-llm-drift-core-repro
description: Reproduce the core closed-source post-release drift result with Codex using only public LiveBench snapshots and arena leaderboard history.
allowed-tools: Bash(python3 *), Bash(curl *), Bash(mkdir *), WebFetch
---
# Goal
Reproduce the core public-data claim of this submission:
1. closed-source models show declining objective performance after release
2. the main subjective leaderboard signal also declines
This executable contract is intentionally bounded. It reproduces the paper's core direction on a strict, timestamp-stable closed-source subset using fully public sources. Pairwise preference remains a robustness result in the paper, not part of the cold-start executable contract.
The intended execution environment is `Codex`.
## Inputs
- public LiveBench snapshot tables from `livebench.github.io`
- public `arena-catalog` leaderboard history from GitHub commits
## Execution
Create a fresh workspace and run the Python source below. The script writes `output/results.json` and exits with failure if the core result is not reproduced.
```python
#!/usr/bin/env python3
from __future__ import annotations
import csv
import hashlib
import json
import math
import re
import statistics
import urllib.parse
import urllib.request
from collections import defaultdict
from datetime import datetime
from pathlib import Path
LIVEBENCH_DATES = [
"2024_06_24",
"2024_07_26",
"2024_08_31",
"2024_11_25",
"2025_04_02",
"2025_04_25",
"2025_05_30",
"2025_11_25",
"2025_12_23",
"2026_01_08",
]
ARENA_REPO = "lmarena/arena-catalog"
ARENA_PATH = "data/leaderboard-text.json"
ARENA_CATEGORY = "full"
RELEASE_DATE_OVERRIDES = {
"command-r-08-2024": "2024-08-30",
"command-r-plus-08-2024": "2024-08-30",
"grok-4-0709": "2025-07-09",
"claude-3-7-sonnet-20250219-base": "2025-02-19",
"chatgpt-4o-latest-2025-03-27": "2025-03-27",
"gpt-3.5-turbo-0125": "2024-01-25",
"gpt-4-0125-preview": "2024-01-25",
"gpt-4-0613": "2023-06-13",
"amazon.nova-pro-v1-0": "2024-12-05",
"gemini-1.5-flash-8b-exp-0827": "2024-08-27",
}
ALIAS_OVERRIDES = {
"amazon.nova-pro-v1:0": "amazon.nova-pro-v1-0",
"chatgpt-4o-latest-20250326": "chatgpt-4o-latest-2025-03-27",
"claude-3-7-sonnet-20250219": "claude-3-7-sonnet-20250219-base",
}
OPEN_PREFIXES = (
"llama",
"meta-llama",
"gemma",
"qwen",
"qwq",
"deepseek",
"mistral",
"mixtral",
"phi",
"gpt-oss",
"open-mistral",
"olmo",
"glm",
)
def fetch_text(url: str) -> str:
with urllib.request.urlopen(url, timeout=60) as resp:
return resp.read().decode("utf-8")
def fetch_json(url: str):
return json.loads(fetch_text(url))
def canonicalize(text: str) -> str:
text = str(text).strip().lower()
text = text.replace("/", "-").replace("_", "-").replace(" ", "-").replace(":", "-")
text = re.sub(r"[^a-z0-9\-.]+", "-", text)
text = re.sub(r"-{2,}", "-", text).strip("-")
return ALIAS_OVERRIDES.get(text, text)
def is_open_weight(model_id: str) -> bool:
return model_id.startswith(OPEN_PREFIXES)
def parse_date(text: str) -> datetime:
return datetime.strptime(text, "%Y-%m-%d")
def age_months(eval_date: datetime, release_date: datetime) -> float:
return (eval_date - release_date).days / 30.44
def infer_release_date(model_id: str, first_observed: datetime | None) -> tuple[datetime | None, str]:
if model_id in RELEASE_DATE_OVERRIDES:
return parse_date(RELEASE_DATE_OVERRIDES[model_id]), "override"
m = re.search(r"(20\d{2})-(\d{2})-(\d{2})", model_id)
if m:
return parse_date(f"{m.group(1)}-{m.group(2)}-{m.group(3)}"), "parsed_from_name"
if first_observed is not None:
return first_observed, "first_observed_proxy"
return None, "unresolved"
def normal_p_value_from_t(t_value: float) -> float:
return math.erfc(abs(t_value) / math.sqrt(2.0))
def regression_one_regressor(x: list[float], y: list[float], dof: int) -> tuple[float, float]:
sxx = sum(v * v for v in x)
if sxx == 0 or len(x) < 3:
return float("nan"), float("nan")
beta = sum(a * b for a, b in zip(x, y)) / sxx
resid = [yy - beta * xx for xx, yy in zip(x, y)]
sigma2 = sum(r * r for r in resid) / max(dof, 1)
se = math.sqrt(sigma2 / sxx) if sxx > 0 else float("nan")
t_value = beta / se if se > 0 else float("nan")
return beta, normal_p_value_from_t(t_value) if math.isfinite(t_value) else float("nan")
def demean_one_way(rows: list[dict], y_key: str, x_key: str, group_key: str) -> tuple[list[float], list[float], int]:
grouped = defaultdict(list)
for i, row in enumerate(rows):
grouped[row[group_key]].append(i)
x = [row[x_key] for row in rows]
y = [row[y_key] for row in rows]
for idxs in grouped.values():
mx = statistics.fmean(x[i] for i in idxs)
my = statistics.fmean(y[i] for i in idxs)
for i in idxs:
x[i] -= mx
y[i] -= my
dof = len(rows) - len(grouped) - 1
return x, y, dof
def demean_two_way(rows: list[dict], y_key: str, x_key: str, g1: str, g2: str, iters: int = 20):
x = [row[x_key] for row in rows]
y = [row[y_key] for row in rows]
groups = []
for key in (g1, g2):
grouped = defaultdict(list)
for i, row in enumerate(rows):
grouped[row[key]].append(i)
groups.append(grouped)
for _ in range(iters):
for grouped in groups:
for idxs in grouped.values():
mx = statistics.fmean(x[i] for i in idxs)
my = statistics.fmean(y[i] for i in idxs)
for i in idxs:
x[i] -= mx
y[i] -= my
dof = len(rows) - len(groups[0]) - len(groups[1]) - 1
return x, y, dof
def load_livebench():
task_rows = []
first_seen = {}
for date_token in LIVEBENCH_DATES:
url = f"https://raw.githubusercontent.com/LiveBench/livebench.github.io/main/public/table_{date_token}.csv"
release_date = parse_date(date_token.replace("_", "-"))
reader = csv.DictReader(fetch_text(url).splitlines())
raw_rows = list(reader)
task_names = [c for c in reader.fieldnames if c != "model"]
for task in task_names:
vals = []
for row in raw_rows:
try:
vals.append(float(row[task]))
except Exception:
pass
mean = statistics.fmean(vals)
std = statistics.pstdev(vals) if len(vals) > 1 else 0.0
for row in raw_rows:
try:
score = float(row[task])
except Exception:
continue
model_id = canonicalize(row["model"])
first_seen[model_id] = min(first_seen.get(model_id, release_date), release_date)
z = 0.0 if std == 0 else (score - mean) / std
task_rows.append(
{
"model_id": model_id,
"evaluation_date": release_date,
"task_name": task,
"score_z": z,
}
)
return task_rows, first_seen
def parse_snapshot_date(message: str, commit_date: datetime) -> datetime:
m = re.search(r"(20\d{2}-\d{2}-\d{2})", message)
if m:
return parse_date(m.group(1))
m = re.search(r"\b(\d{1,2})/(\d{1,2})\b", message)
if m:
return datetime(commit_date.year, int(m.group(1)), int(m.group(2)))
return commit_date
def load_leaderboard():
api = (
f"https://api.github.com/repos/{ARENA_REPO}/commits?"
+ urllib.parse.urlencode({"path": ARENA_PATH, "per_page": 100, "page": 1})
)
commits = fetch_json(api)
monthly_rows = []
content_hashes = set()
model_first_seen = {}
for commit in commits:
sha = commit["sha"]
message = commit["commit"]["message"].splitlines()[0]
commit_date = datetime.fromisoformat(commit["commit"]["committer"]["date"].replace("Z", "+00:00")).replace(tzinfo=None)
snapshot_date = parse_snapshot_date(message, commit_date)
raw_url = f"https://raw.githubusercontent.com/{ARENA_REPO}/{sha}/{ARENA_PATH}"
text = fetch_text(raw_url)
digest = hashlib.sha1(text.encode("utf-8")).hexdigest()
if digest in content_hashes:
continue
content_hashes.add(digest)
payload = json.loads(text)
leaderboard = payload[ARENA_CATEGORY]
ratings = [float(v["rating"]) for v in leaderboard.values()]
mean = statistics.fmean(ratings)
std = statistics.pstdev(ratings) if len(ratings) > 1 else 0.0
month = datetime(snapshot_date.year, snapshot_date.month, 1)
for model_name, stats in leaderboard.items():
model_id = canonicalize(model_name)
model_first_seen[model_id] = min(model_first_seen.get(model_id, month), month)
rating = float(stats["rating"])
z = 0.0 if std == 0 else (rating - mean) / std
monthly_rows.append(
{
"model_id": model_id,
"date": month,
"subjective_score_std": z,
}
)
grouped = defaultdict(list)
for row in monthly_rows:
grouped[(row["model_id"], row["date"])].append(row["subjective_score_std"])
out = []
for (model_id, date), vals in grouped.items():
out.append({"model_id": model_id, "date": date, "subjective_score_std": statistics.fmean(vals)})
return out, model_first_seen
def main():
objective_rows, obj_first_seen = load_livebench()
leaderboard_rows, lead_first_seen = load_leaderboard()
model_ids = sorted(set(obj_first_seen) | set(lead_first_seen))
model_meta = {}
for model_id in model_ids:
first_obs = min(
[d for d in [obj_first_seen.get(model_id), lead_first_seen.get(model_id)] if d is not None],
default=None,
)
release_date, release_source = infer_release_date(model_id, first_obs)
version_uncertainty = "high" if ("latest" in model_id or release_source == "first_observed_proxy") else "low"
model_meta[model_id] = {
"release_date": release_date,
"version_uncertainty": version_uncertainty,
"is_open_weight": is_open_weight(model_id),
}
obj_tp = defaultdict(set)
sub_tp = defaultdict(set)
for row in objective_rows:
obj_tp[row["model_id"]].add(row["evaluation_date"])
for row in leaderboard_rows:
sub_tp[row["model_id"]].add(row["date"])
closed_main = []
for model_id, meta in model_meta.items():
if meta["is_open_weight"]:
continue
if meta["version_uncertainty"] != "low":
continue
if len(obj_tp[model_id]) >= 3 and len(sub_tp[model_id]) >= 3:
closed_main.append(model_id)
open_control = []
for model_id, meta in model_meta.items():
if not meta["is_open_weight"]:
continue
if len(obj_tp[model_id]) >= 3:
open_control.append(model_id)
obj_main_rows = []
for row in objective_rows:
if row["model_id"] in closed_main:
meta = model_meta[row["model_id"]]
r = dict(row)
r["age_months"] = age_months(r["evaluation_date"], meta["release_date"])
obj_main_rows.append(r)
obj_open_rows = []
for row in objective_rows:
if row["model_id"] in open_control:
meta = model_meta[row["model_id"]]
r = dict(row)
r["age_months"] = age_months(r["evaluation_date"], meta["release_date"])
obj_open_rows.append(r)
lead_main_rows = []
for row in leaderboard_rows:
if row["model_id"] in closed_main:
meta = model_meta[row["model_id"]]
r = dict(row)
r["age_months"] = age_months(r["date"], meta["release_date"])
lead_main_rows.append(r)
x_obj, y_obj, dof_obj = demean_two_way(obj_main_rows, "score_z", "age_months", "model_id", "task_name")
obj_beta, obj_p = regression_one_regressor(x_obj, y_obj, dof_obj)
x_open, y_open, dof_open = demean_two_way(obj_open_rows, "score_z", "age_months", "model_id", "task_name")
open_beta, open_p = regression_one_regressor(x_open, y_open, dof_open)
x_lead, y_lead, dof_lead = demean_one_way(lead_main_rows, "subjective_score_std", "age_months", "model_id")
lead_beta, lead_p = regression_one_regressor(x_lead, y_lead, dof_lead)
output = {
"closed_main_models": len(closed_main),
"open_control_models": len(open_control),
"objective_beta": obj_beta,
"objective_p": obj_p,
"objective_open_beta": open_beta,
"objective_open_p": open_p,
"leaderboard_beta": lead_beta,
"leaderboard_p": lead_p,
"closed_main_model_ids": sorted(closed_main),
}
Path("output").mkdir(exist_ok=True)
Path("output/results.json").write_text(json.dumps(output, indent=2))
print(json.dumps(output, indent=2))
if not (len(closed_main) >= 10 and obj_beta < 0 and lead_beta < 0):
raise SystemExit("Core reproducibility contract failed.")
if __name__ == "__main__":
main()
```
## Expected Output
The script should print a JSON object and create `output/results.json`.
Success condition:
- `closed_main_models >= 10`
- `objective_beta < 0`
- `leaderboard_beta < 0`
## Expected Artifacts
- `output/results.json`
Core fields:
- `closed_main_models`
- `open_control_models`
- `objective_beta`
- `objective_p`
- `objective_open_beta`
- `objective_open_p`
- `leaderboard_beta`
- `leaderboard_p`
- `closed_main_model_ids`
## Interpretation Rules
- This is a bounded executable contract for the paper's core result.
- The main subjective variable is the monthly arena leaderboard z-score.
- Pairwise preference is reported in the paper as robustness evidence, but not required for this cold-start executable contract.
- The intended execution environment is `Codex`.
## Failure Rules
- If GitHub or LiveBench public files are unavailable, fail explicitly.
- If the reproduced closed-source sample falls below 10 models, fail explicitly.
- If either the closed-source objective slope or the leaderboard slope is non-negative, fail explicitly.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.