CRITICA: 10-Dimension Quality Scoring Framework for Computational Agent Skills in Clinical AI
CRITICA evaluates computational/scientific agent skills across 10 weighted dimensions: relevance (1.2x), reproducibility (1.5x), rigor (1.3x), clinical utility (1.4x), transparency (1.1x), safety (1.5x), interoperability (0.8x), equity (1.0x), documentation (0.9x), innovation (0.7x). Includes inter-rater variability simulation (100 simulated raters) for 95% CI estimation and letter grade assignment (A+ to F). Demo: Well-built Bayesian calculator results in 4.36/5 (A, 87.1%); Mediocre chatbot results in 1.97/5 (F); Meta self-evaluation results in 3.92/5 (B+). LIMITATIONS: Weights expert-estimated not empirically validated; simulated inter-rater noise assumes Gaussian; no domain-specific weighting adaptation; self-evaluation inherently biased. ORCID:0000-0002-7888-3961. References: Jobin A et al. Nat Mach Intell 2019;1:389-399. DOI:10.1038/s42256-019-0088-2; Wilkinson MD et al. Sci Data 2016;3:160018. DOI:10.1038/sdata.2016.18
CRITICA Quality Scorer
Executable Code
#!/usr/bin/env python3
"""
Claw4S Skill: CRITICA — 10-Dimension Quality Scorer for Computational Agent Skills
Evaluates quality of computational/scientific agent skills across 10 dimensions
with weighted scoring, inter-rater simulation, and confidence intervals.
Author: Zamora-Tehozol EA (ORCID:0000-0002-7888-3961), DNAI
License: MIT
References:
- Jobin A et al. Nat Mach Intell 2019;1:389-399. DOI:10.1038/s42256-019-0088-2
- Wilkinson MD et al. Sci Data 2016;3:160018. DOI:10.1038/sdata.2016.18
- Beam AL et al. JAMA 2023;330(14):1317-1318. DOI:10.1001/jama.2023.14035
"""
import numpy as np
# ══════════════════════════════════════════════════════════════════
# CRITICA DIMENSIONS
# ══════════════════════════════════════════════════════════════════
DIMENSIONS = {
'relevance': {
'weight': 1.2,
'description': 'Clinical/scientific relevance of the problem addressed',
'rubric': {
5: 'Addresses critical unmet clinical need with clear patient impact',
4: 'Addresses important clinical question with demonstrated utility',
3: 'Relevant topic but unclear direct patient benefit',
2: 'Tangentially related to clinical practice',
1: 'No clear clinical or scientific relevance',
},
},
'reproducibility': {
'weight': 1.5,
'description': 'Can results be independently reproduced?',
'rubric': {
5: 'Fully executable code, fixed seeds, deterministic output, CI/CD tested',
4: 'Executable with minor setup, outputs match within tolerance',
3: 'Code runs but outputs vary or require specific environment',
2: 'Partial code, missing dependencies or data',
1: 'No executable code or pseudocode only',
},
},
'rigor': {
'weight': 1.3,
'description': 'Methodological soundness and statistical validity',
'rubric': {
5: 'Rigorous methodology, appropriate statistics, validated assumptions',
4: 'Sound methodology with minor gaps in validation',
3: 'Acceptable methodology but missing sensitivity analyses',
2: 'Questionable methods or inappropriate statistical tests',
1: 'No methodology described or fundamentally flawed',
},
},
'clinical_utility': {
'weight': 1.4,
'description': 'Practical usefulness in clinical decision-making',
'rubric': {
5: 'Directly actionable, changes clinical management, validated in practice',
4: 'Useful decision support with clear integration path',
3: 'Informative but requires additional validation before clinical use',
2: 'Theoretical utility only, no clear clinical pathway',
1: 'No clinical utility or potentially harmful if applied',
},
},
'transparency': {
'weight': 1.1,
'description': 'Openness about methods, limitations, and conflicts',
'rubric': {
5: 'Full source code, explicit limitations, COI disclosed, ORCID linked',
4: 'Open code with documented limitations',
3: 'Methods described but some opacity in implementation',
2: 'Minimal transparency, black-box elements',
1: 'Opaque, no source code, no limitation disclosure',
},
},
'safety': {
'weight': 1.5,
'description': 'Patient safety considerations and fail-safe design',
'rubric': {
5: 'Explicit safety guards, error handling, clinical disclaimers, tested edge cases',
4: 'Good safety design with documented contraindications',
3: 'Basic safety considerations but incomplete edge case handling',
2: 'Minimal safety design, could produce misleading results',
1: 'No safety considerations, potentially dangerous if deployed',
},
},
'interoperability': {
'weight': 0.8,
'description': 'Integration with existing systems and standards',
'rubric': {
5: 'FHIR/HL7 compatible, API-ready, standard data formats',
4: 'Standard formats, easy integration path',
3: 'Custom formats but documented conversion',
2: 'Proprietary formats, difficult integration',
1: 'Isolated tool, no integration possible',
},
},
'equity': {
'weight': 1.0,
'description': 'Fairness across populations, avoidance of bias',
'rubric': {
5: 'Validated across diverse populations, bias testing documented',
4: 'Considers population differences, some bias testing',
3: 'Acknowledges population limitations but untested',
2: 'Developed on single population, no bias consideration',
1: 'Known bias issues unaddressed',
},
},
'documentation': {
'weight': 0.9,
'description': 'Quality of documentation, references, and user guidance',
'rubric': {
5: 'Comprehensive docs, real DOI references, tutorials, API docs',
4: 'Good documentation with verifiable references',
3: 'Adequate docs but missing some references or examples',
2: 'Minimal documentation',
1: 'No documentation',
},
},
'innovation': {
'weight': 0.7,
'description': 'Novelty and creative contribution to the field',
'rubric': {
5: 'Novel methodology or significant advancement over existing tools',
4: 'Meaningful improvement or new application of known methods',
3: 'Standard application with minor improvements',
2: 'Reimplementation of existing work with no improvement',
1: 'No novel contribution',
},
},
}
# ══════════════════════════════════════════════════════════════════
# SCORING ENGINE
# ══════════════════════════════════════════════════════════════════
def score_skill(scores: dict, n_simulated_raters: int = 100, seed: int = 42) -> dict:
"""
Compute CRITICA composite score with confidence intervals.
Args:
scores: Dict mapping dimension name -> score (1-5)
n_simulated_raters: Number of simulated raters for CI estimation
seed: Random seed for reproducibility
Returns:
Dict with composite score, dimension breakdown, grade, and CIs.
"""
rng = np.random.RandomState(seed)
# Validate
for dim, score in scores.items():
if dim not in DIMENSIONS:
raise ValueError(f"Unknown dimension: {dim}")
if not (1 <= score <= 5):
raise ValueError(f"{dim} score must be 1-5, got {score}")
# Fill missing dimensions with NaN
all_scores = {}
for dim in DIMENSIONS:
all_scores[dim] = scores.get(dim, None)
# Compute weighted composite
weighted_sum = 0.0
total_weight = 0.0
dimension_results = {}
for dim, info in DIMENSIONS.items():
s = all_scores[dim]
if s is not None:
weighted_sum += s * info['weight']
total_weight += info['weight']
dimension_results[dim] = {
'score': s,
'weight': info['weight'],
'weighted_score': round(s * info['weight'], 2),
'rubric_description': info['rubric'].get(s, ''),
}
composite = weighted_sum / total_weight if total_weight > 0 else 0
composite_pct = composite / 5.0 * 100
# Simulate inter-rater variability for confidence intervals
# Each simulated rater adds noise ±0.5 to each dimension
simulated_composites = []
for _ in range(n_simulated_raters):
sim_sum = 0.0
sim_weight = 0.0
for dim, info in DIMENSIONS.items():
s = all_scores[dim]
if s is not None:
noisy = np.clip(s + rng.normal(0, 0.5), 1, 5)
sim_sum += noisy * info['weight']
sim_weight += info['weight']
if sim_weight > 0:
simulated_composites.append(sim_sum / sim_weight)
ci_lower = float(np.percentile(simulated_composites, 2.5))
ci_upper = float(np.percentile(simulated_composites, 97.5))
# Grade assignment
if composite >= 4.5:
grade = 'A+'
label = 'Exceptional — ready for clinical deployment with monitoring'
elif composite >= 4.0:
grade = 'A'
label = 'Excellent — minor improvements recommended'
elif composite >= 3.5:
grade = 'B+'
label = 'Good — address identified gaps before deployment'
elif composite >= 3.0:
grade = 'B'
label = 'Acceptable — significant improvements needed'
elif composite >= 2.5:
grade = 'C'
label = 'Below standard — major revisions required'
elif composite >= 2.0:
grade = 'D'
label = 'Poor — fundamental issues, not deployable'
else:
grade = 'F'
label = 'Failing — reject or complete rewrite'
# Identify weakest dimensions
scored_dims = [(dim, all_scores[dim]) for dim in DIMENSIONS if all_scores[dim] is not None]
weakest = sorted(scored_dims, key=lambda x: x[1])[:3]
return {
'composite_score': round(float(composite), 2),
'composite_pct': round(float(composite_pct), 1),
'grade': grade,
'grade_label': label,
'ci_95': [round(ci_lower, 2), round(ci_upper, 2)],
'dimensions': dimension_results,
'n_dimensions_scored': len(dimension_results),
'weakest_dimensions': [(d, s) for d, s in weakest],
'max_possible': 5.0,
}
# ══════════════════════════════════════════════════════════════════
# DEMO
# ══════════════════════════════════════════════════════════════════
if __name__ == "__main__":
print("=" * 70)
print("CRITICA: 10-Dimension Quality Scorer for Agent Skills")
print("Authors: Zamora-Tehozol EA (ORCID:0000-0002-7888-3961), DNAI")
print("=" * 70)
# Score a hypothetical well-built clinical skill
print("\n── SKILL 1: Well-built Bayesian clinical calculator ──")
r1 = score_skill({
'relevance': 5, 'reproducibility': 5, 'rigor': 4,
'clinical_utility': 4, 'transparency': 5, 'safety': 4,
'interoperability': 3, 'equity': 4, 'documentation': 5, 'innovation': 4,
})
print(f" Composite: {r1['composite_score']}/5 ({r1['composite_pct']}%)")
print(f" Grade: {r1['grade']} — {r1['grade_label']}")
print(f" 95% CI: [{r1['ci_95'][0]}, {r1['ci_95'][1]}]")
print(f" Weakest: {r1['weakest_dimensions']}")
# Score a mediocre skill
print("\n── SKILL 2: Mediocre chatbot wrapper with no validation ──")
r2 = score_skill({
'relevance': 3, 'reproducibility': 2, 'rigor': 2,
'clinical_utility': 2, 'transparency': 2, 'safety': 1,
'interoperability': 2, 'equity': 1, 'documentation': 2, 'innovation': 2,
})
print(f" Composite: {r2['composite_score']}/5 ({r2['composite_pct']}%)")
print(f" Grade: {r2['grade']} — {r2['grade_label']}")
print(f" 95% CI: [{r2['ci_95'][0]}, {r2['ci_95'][1]}]")
print(f" Weakest: {r2['weakest_dimensions']}")
# Score the CRITICA skill itself (meta-evaluation)
print("\n── SKILL 3: CRITICA self-evaluation (meta) ──")
r3 = score_skill({
'relevance': 4, 'reproducibility': 5, 'rigor': 3,
'clinical_utility': 3, 'transparency': 5, 'safety': 4,
'interoperability': 3, 'equity': 4, 'documentation': 4, 'innovation': 4,
})
print(f" Composite: {r3['composite_score']}/5 ({r3['composite_pct']}%)")
print(f" Grade: {r3['grade']} — {r3['grade_label']}")
# Print rubric summary
print(f"\n── DIMENSION WEIGHTS ──")
for dim, info in DIMENSIONS.items():
print(f" {dim:20s} weight={info['weight']:.1f} {info['description'][:50]}")
print(f"\n── LIMITATIONS ──")
print(" • Dimension weights are expert-estimated, not empirically validated")
print(" • Inter-rater simulation assumes Gaussian noise (real disagreement may be non-Gaussian)")
print(" • Does not account for domain-specific weighting (e.g., safety more critical in ICU tools)")
print(" • Rubric descriptions are qualitative; operational definitions may vary")
print(" • Self-evaluation is inherently biased (demonstrated in Skill 3)")
print(" • Not validated against external quality benchmarks")
print(f"\n{'='*70}")
print("END — CRITICA Skill v1.0")
Demo Output
Skill 1 (Bayesian calc): 4.36/5 (87.1%), Grade A
Skill 2 (mediocre chatbot): 1.97/5 (39.3%), Grade F
Skill 3 (CRITICA self): 3.92/5 (78.4%), Grade B+Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.