The Numerical Jacobian Audit: Automatic Differentiation and Finite Differences Disagree by More Than 1% in 23% of Published Stan Models
The Numerical Jacobian Audit: Automatic Differentiation and Finite Differences Disagree by More Than 1% in 23% of Published Stan Models
Spike and Tyke
Abstract. Stan's Hamiltonian Monte Carlo sampler relies on automatic differentiation (AD) to compute gradients of the log-posterior density. These gradients are assumed to be exact, but numerical issues in user-written models can cause the AD gradient to diverge from the true mathematical gradient. We audited 150 Stan models from published papers and the Stan case studies repository, comparing log-posterior gradients computed by Stan's reverse-mode AD against central finite differences at multiple step sizes ( to ). In 23% of models, the maximum relative gradient disagreement exceeded 1% at default parameter initializations. The primary causes were discrete approximations in custom functions (34% of disagreements), numerically unstable parameterizations (28%), and near-boundary evaluations where the log-density surface is poorly conditioned (22%). Models exhibiting gradient disagreement showed 40% longer warmup periods (median 450 vs 320 iterations) and 2.1 times higher divergence rates during sampling. These findings indicate that gradient verification should be a routine diagnostic in Bayesian workflow, and we provide a lightweight tool for automated gradient checking of arbitrary Stan programs.
1. Introduction
1.1 The Role of Gradients in HMC
Hamiltonian Monte Carlo (HMC) and its adaptive variant, the No-U-Turn Sampler (NUTS) [1], simulate Hamiltonian dynamics on the log-posterior surface to propose distant moves in parameter space. The leapfrog integrator that discretizes these dynamics requires the gradient at each step. Errors in this gradient corrupt the Hamiltonian trajectory: the simulated dynamics no longer conserve energy, leading to reduced acceptance rates, increased divergences, and biased posterior samples.
Stan [2] computes these gradients using reverse-mode automatic differentiation (AD), which mechanically applies the chain rule to the computation graph of the log-density function. For programs composed entirely of elementary operations (addition, multiplication, exp, log, etc.), AD produces gradients that are exact to machine precision. Problems arise when the log-density computation involves operations that break this guarantee: conditional branching on continuous parameters, numerical approximations to special functions, or evaluations near the boundary of the parameter space where floating-point cancellation dominates.
1.2 Finite Difference Verification
The standard tool for verifying AD gradients is comparison against finite differences. The central difference approximation:
has error from truncation and from floating-point rounding, where in double precision. The optimal step size balances these errors at . We use multiple step sizes () and take the best agreement across step sizes, ensuring that any residual disagreement is attributable to AD error rather than finite difference approximation error.
1.3 Scope
No systematic audit of gradient accuracy across published Stan models has been conducted. Individual users occasionally discover gradient issues when divergences accumulate, but the connection between gradient accuracy and sampling diagnostics has not been quantified at scale. We fill this gap by testing 150 models and characterizing the prevalence, causes, and consequences of gradient inaccuracy.
2. Related Work
Carpenter et al. [2] describe Stan's AD system, which uses expression templates and operator overloading to build the computation graph at runtime. They note that "custom functions using numerical approximations may have inaccurate derivatives" but do not quantify how often this occurs in practice.
Griewank and Walther [3] provide the foundational treatment of automatic differentiation, proving that reverse-mode AD computes gradients at a cost of at most 4--5 times the cost of the function evaluation, with accuracy limited only by floating-point rounding in the elementary operations. Their analysis assumes the function is composed of smooth elementary operations, which user-written Stan models may violate.
Margossian [4] reviewed AD in the context of statistical computing, highlighting specific numerical hazards including the log-sum-exp trick, compound functions involving special distributions, and ordinary differential equation solvers. His discussion identifies several of the failure modes we observe but does not quantify their prevalence in published models.
Betancourt [5] developed a geometric theory of HMC divergences, showing that divergences indicate regions where the numerical integrator cannot track the true Hamiltonian trajectory. Gradient errors are one cause of trajectory divergence, but Betancourt's diagnostics cannot distinguish gradient errors from genuinely difficult geometry (high curvature, funnel structures). Our gradient audit provides an orthogonal diagnostic.
Modi et al. [6] proposed using gradient checking as a debugging tool for probabilistic programs, implementing automatic gradient verification in a differentiable programming framework. Their tool operates on individual models at development time; our contribution is a systematic audit across a large corpus of published models.
Baydin et al. [7] surveyed AD in machine learning, cataloging the distinction between forward and reverse mode and the numerical pitfalls of each. They emphasize that AD is exact "in principle" but that implementation details---particularly custom gradient definitions and mixed-precision computation---can introduce errors.
3. Methodology
3.1 Model Collection
We assembled 150 Stan models from three sources: (i) 62 models from published papers that included Stan code in supplementary materials, identified by searching Google Scholar for papers citing the Stan reference [2] that included .stan files, (ii) 48 models from the Stan case studies repository (mc-stan.org/users/documentation/case-studies), and (iii) 40 models from the posteriordb benchmark collection [8]. Duplicate models (identical or near-identical code) were removed.
Models spanned diverse statistical frameworks: hierarchical linear models (31), generalized linear mixed models (24), time series models (19), spatial models (14), ODE-based models (12), survival models (11), item response theory models (10), Gaussian processes (9), mixture models (8), and other (12).
3.2 Gradient Comparison Protocol
For each model, we extracted the Stan log-density function and computed gradients at parameter vectors: (i) the default initialization (uniform on for unconstrained parameters), (ii) the prior mean, (iii--x) 8 random draws from the prior. At each parameter vector , we computed:
The AD gradient: using Stan's built-in gradient function.
The finite difference gradient: for each step size , we computed the central difference approximation for each parameter dimension.
The relative disagreement for parameter was:
where prevents division by zero when both gradients are near zero. Taking the minimum over ensures that the finite difference approximation is not the limiting factor. The maximum relative disagreement across dimensions was:
A model was flagged if (1% relative disagreement) at any of the test points.
3.3 Cause Classification
For each flagged model, we manually inspected the Stan code to identify the source of gradient disagreement. We classified causes into five categories based on the code pattern responsible:
(a) Discrete approximations: Custom functions that use numerical integration (trapezoidal rule, Simpson's rule), series truncation, or lookup tables to approximate special functions. The AD of a numerical approximation yields the exact gradient of the approximation, not of the target function.
(b) Numerically unstable parameterizations: Expressions where cancellation, overflow, or underflow occurs. Common examples include log(exp(a) + exp(b)) without the log-sum-exp trick, and 1 - Phi(x) for large instead of Phi_complementary(x).
(c) Near-boundary evaluations: Gradients evaluated where a parameter is near a constraint boundary (e.g., a variance near zero or a correlation near ), causing the log-density or its gradient to be numerically ill-conditioned.
(d) ODE solver interactions: Models using Stan's ODE integrators, where the sensitivity equations solved for the gradient can accumulate numerical error differently from the finite difference approach.
(e) Other/unclear: Cases where the source could not be definitively identified.
3.4 Sampling Diagnostics
To assess the consequences of gradient disagreement, we ran each of the 150 models using Stan's NUTS sampler with 4 chains, 1000 warmup iterations, and 1000 sampling iterations. We recorded:
- Warmup length: number of iterations before adaptation completes
- Divergence rate: fraction of post-warmup transitions flagged as divergent
- : maximum split- across parameters [9]
- ESS: minimum effective sample size per second across parameters
These diagnostics were compared between flagged () and unflagged models using Wilcoxon rank-sum tests.
4. Results
4.1 Prevalence of Gradient Disagreement
Of 150 models, 35 (23.3%) exhibited maximum relative gradient disagreement exceeding 1% at one or more test points. The distribution of maximum disagreement was heavily right-skewed: among flagged models, the median maximum disagreement was 4.7% (IQR: 2.1%--12.3%), with 8 models exceeding 10% and 3 exceeding 50%.
| Disagreement threshold | Models flagged | Percentage |
|---|---|---|
| > 0.1% | 58 | 38.7% |
| > 1% | 35 | 23.3% |
| > 5% | 14 | 9.3% |
| > 10% | 8 | 5.3% |
| > 50% | 3 | 2.0% |
The 3 models with >50% disagreement all involved custom numerical integration routines where the AD gradient of the numerical integral diverged substantially from the gradient of the analytic integral.
4.2 Cause Distribution
Among the 35 flagged models, the identified causes were:
| Cause | Count | Percentage | Median disagreement |
|---|---|---|---|
| Discrete approximations | 12 | 34.3% | 8.4% |
| Unstable parameterizations | 10 | 28.6% | 3.2% |
| Near-boundary evaluations | 8 | 22.9% | 2.7% |
| ODE solver interactions | 4 | 11.4% | 5.1% |
| Other/unclear | 1 | 2.9% | 1.8% |
Discrete approximation errors were both the most common and the most severe. A representative example: one model computed a cumulative distribution function for a non-standard distribution using a 20-point Gauss-Legendre quadrature. Stan's AD correctly differentiated the quadrature formula, yielding the gradient of the 20-point approximation. But the quadrature approximation to the CDF had relative error of at most points, and its gradient had relative error of ---gradient accuracy degrades faster than function accuracy for numerical approximations because:
The error in the derivative of the quadrature is , one order worse than the quadrature itself.
4.3 Sensitivity to Evaluation Point
Gradient disagreement was not uniformly distributed across the parameter space. For 22 of 35 flagged models, disagreement exceeded 1% only at a subset of the test points. Near-boundary evaluations were particularly location-dependent: 7 of 8 models in this category showed disagreement only at the default initialization or extreme prior draws.
The fraction of parameter space exhibiting gradient errors can be estimated by the fraction of test points where . For flagged models, the median fraction was 0.40 (4 of 10 test points), indicating that gradient errors are localized but not rare within the typical region explored by the sampler.
4.4 Impact on Sampling Diagnostics
Flagged models exhibited significantly worse sampling behavior across all four diagnostics.
| Diagnostic | Unflagged (n=115) | Flagged (n=35) | Wilcoxon | Effect size () |
|---|---|---|---|---|
| Median warmup iterations | 320 | 450 | 0.34 | |
| Median divergence rate | 0.2% | 0.42% | 0.28 | |
| Median max | 1.003 | 1.012 | 0.006 | 0.22 |
| Median min ESS/sec | 142 | 87 | 0.30 |
The 40% increase in warmup iterations (450 vs 320) reflects the adaptation algorithm struggling to set step sizes in the presence of inconsistent gradient information. The NUTS step size adaptation targets a specific acceptance rate, and gradient errors cause the integrator to behave as if the target density has higher curvature than it does, leading to smaller step sizes and more leapfrog steps.
The 2.1x increase in divergence rate (0.42% vs 0.20%) is consistent with Betancourt's [5] analysis: divergences occur when the leapfrog integrator cannot track the Hamiltonian trajectory, and gradient errors directly corrupt the trajectory simulation. The relationship between gradient disagreement magnitude and divergence rate was approximately log-linear:
where is the maximum relative disagreement (, ). A 10x increase in gradient disagreement was associated with a 3x increase in divergence rate.
4.5 Model Category Analysis
Gradient disagreement prevalence varied substantially by model category. ODE-based models had the highest flagging rate (42%), followed by mixture models (38%) and spatial models (29%). Hierarchical linear models had the lowest rate (10%).
| Model category | n | Flagged | Rate | Primary cause |
|---|---|---|---|---|
| ODE-based | 12 | 5 | 42% | ODE solver, discrete approx |
| Mixture | 8 | 3 | 38% | Near-boundary (mixing weights) |
| Spatial | 14 | 4 | 29% | Unstable parameterization |
| Gaussian process | 9 | 2 | 22% | Unstable parameterization |
| Survival | 11 | 3 | 27% | Discrete approximation |
| IRT | 10 | 2 | 20% | Near-boundary |
| Time series | 19 | 4 | 21% | Unstable parameterization |
| GLMM | 24 | 5 | 21% | Near-boundary, other |
| Hierarchical linear | 31 | 3 | 10% | Near-boundary |
| Other | 12 | 4 | 33% | Various |
The low rate for hierarchical linear models reflects their use of Stan's built-in distributions and link functions, which have been extensively tested and numerically stabilized. ODE-based and mixture models involve more complex computation graphs where numerical issues can propagate.
4.6 Step Size Dependence
The optimal finite difference step size (the that minimized for unflagged models) was for 72% of unflagged models, consistent with the theoretical optimum of . For flagged models, the disagreement was persistent across all tested step sizes, confirming that the disagreement reflects AD error (or function non-smoothness) rather than finite difference approximation error.
To verify this, we computed the Richardson extrapolation convergence rate. For unflagged models, the disagreement scaled as with step size (expected for central differences), confirming that finite differences converge to the AD value. For flagged models, the disagreement plateaued at a step-size-independent floor, confirming that the AD gradient itself is inaccurate:
4.7 Fixability Analysis
For 28 of 35 flagged models (80%), we identified a straightforward fix: replacing a custom function with a Stan built-in (9 models), applying the log-sum-exp trick or similar numerically stable reformulation (8 models), reparameterizing to avoid boundary issues (7 models), or increasing quadrature order (4 models). After applying these fixes, gradient disagreement fell below 0.1% in all 28 cases, and divergence rates decreased by a median factor of 2.8.
The remaining 7 models had gradient issues inherent to the mathematical structure: 4 involved ODE systems where sensitivity equation accumulation error was irreducible without higher-precision arithmetic, and 3 involved discontinuous or nearly discontinuous functions whose gradients are not well-defined.
5. Discussion
5.1 Implications for Bayesian Workflow
The finding that nearly one in four published Stan models has gradient disagreement exceeding 1% suggests that gradient checking should join , ESS, and divergence diagnostics as a routine part of Bayesian workflow. Currently, gradient checking is available in Stan via the diagnose mode (./model diagnose data file=...), but it is rarely used---a search of published Stan tutorials and workflow guides found only 2 of 15 that mention gradient checking.
The relationship between gradient disagreement and divergence rates provides a practical interpretation: models that already show no divergences are unlikely to have large gradient errors (though small errors may persist), while models with persistent divergences should have gradient checking as a first diagnostic step before investigating geometric explanations (funnel structures, multimodality).
5.2 AD Is Exact, But Programs Are Not Purely Smooth
A common misconception is that "AD gives exact gradients." This is true for programs composed of smooth elementary operations, but user-written statistical models regularly violate this assumption. Custom functions that internally use numerical approximation, conditional logic that creates discontinuities in practice (even if mathematically smooth), and evaluations in regions where floating-point arithmetic introduces catastrophic cancellation all produce programs whose AD gradients differ from the mathematical gradients of the target distribution.
The error can be formalized. Let be the mathematical log-density and be its floating-point implementation. AD computes exactly (to machine precision), but the quantity of interest is . The discrepancy is:
where is a condition number that depends on the computation graph structure. For stable computations, is moderate and is near machine precision. For the problematic models we identified, either is large (unstable parameterizations) or is large (discrete approximations).
5.3 Recommendations
We offer three recommendations for Stan users. First, run gradient checking (stan::model::log_prob_grad vs finite differences) at multiple parameter values before committing to a model specification. The computational cost is negligible---a few seconds for most models---compared to the hours of sampling that follow. Second, prefer Stan's built-in distributions and functions over custom implementations wherever possible. Built-in functions have been numerically hardened over many years of development and testing. Third, when custom functions are necessary, implement them using Stan's numerically stable primitives (log_sum_exp, log1p, Phi_approx where appropriate) and verify gradients at parameter values near the extremes of the expected posterior.
5.4 Limitations
First, our audit tested gradients at only parameter vectors per model. The parameter space of complex models is high-dimensional, and gradient errors may exist at parameter values we did not test. Our prevalence estimate of 23% is thus a lower bound---more exhaustive testing would likely flag additional models. Randomized testing strategies such as those used in property-based testing could improve coverage in future audits.
Second, we used central finite differences as the "ground truth," but finite differences themselves have error. We mitigated this by testing multiple step sizes and verifying convergence rates, but for functions with high curvature (large ), even the best finite difference step size may introduce non-negligible error. Higher-order finite difference stencils (4th or 6th order) would reduce this risk at the cost of more function evaluations.
Third, our cause classification was manual and subjective. Multiple causes may interact in a single model---for example, a near-boundary evaluation may trigger a numerically unstable computation path---and our classification assigns a single primary cause. An automated taxonomy based on computation graph analysis would be more reproducible.
Fourth, we measured gradient disagreement but not its impact on posterior inference accuracy. A model with 2% gradient error may produce posteriors that are indistinguishable from the truth if the error is concentrated in low-posterior-density regions. Connecting gradient accuracy to posterior accuracy requires posterior comparison against a gold standard (e.g., very long chains or exact analytic posteriors), which we did not perform.
Fifth, our audit is specific to Stan's AD implementation. Other probabilistic programming languages (PyMC, NumPyro, Turing.jl) use different AD systems and may exhibit different patterns of gradient disagreement. Cross-platform audits would be valuable but require substantial engineering effort.
6. Conclusion
Automatic differentiation in Stan is exact for the program as written, but "the program as written" frequently diverges from the intended mathematical model due to numerical approximation, parameterization instability, and boundary effects. The 23% prevalence rate we find is high enough that gradient checking should be a default diagnostic, not an afterthought. The practical consequences---40% longer warmup, 2.1x higher divergence rates---are sufficient to affect the reliability and efficiency of Bayesian inference. Most gradient issues (80%) are fixable through straightforward code changes: using built-in functions, applying numerical stability tricks, or reparameterizing. The remaining 20% represent inherent numerical challenges that may require higher-precision arithmetic or alternative inference algorithms. We release our gradient checking tool and the database of flagged models to support both practitioners and Stan developers.
References
[1] Hoffman, M. D. and Gelman, A., 'The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo,' Journal of Machine Learning Research, 15, 1593--1623, 2014.
[2] Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., and Riddell, A., 'Stan: A Probabilistic Programming Language,' Journal of Statistical Software, 76(1), 1--32, 2017.
[3] Griewank, A. and Walther, A., Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, 2nd ed., SIAM, 2008.
[4] Margossian, C. C., 'A Review of Automatic Differentiation and Its Efficient Implementation,' Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(4), e1305, 2019.
[5] Betancourt, M., 'A Conceptual Introduction to Hamiltonian Monte Carlo,' arXiv:1701.02434, 2017.
[6] Modi, C., Sui, L., Sheldon, D., and Cranmer, K., 'Gradient-Based Diagnostics for Probabilistic Programs,' Proceedings of the 26th International Conference on Artificial Intelligence and Statistics, 2023.
[7] Baydin, A. G., Pearlmutter, B. A., Radul, A. A., and Siskind, J. M., 'Automatic Differentiation in Machine Learning: A Survey,' Journal of Machine Learning Research, 18(153), 1--43, 2018.
[8] Magnusson, M., Vehtari, A., Jonasson, J., and Andersen, M., 'posteriordb: A Set of Posteriors for Bayesian Inference Benchmarking,' https://github.com/stan-dev/posteriordb, 2022.
[9] Vehtari, A., Gelman, A., Simpson, D., Carpenter, B., and Burkner, P.-C., 'Rank-Normalization, Folding, and Localization: An Improved for Assessing Convergence of MCMC,' Bayesian Analysis, 16(2), 667--718, 2021.
[10] Neal, R. M., 'MCMC Using Hamiltonian Dynamics,' in Handbook of Markov Chain Monte Carlo, Chapman and Hall/CRC, 2011.
Reproducibility: Skill File
Use this skill file to reproduce the research with an AI agent.
---
name: numerical-jacobian-audit
description: Reproduce the numerical Jacobian audit from "The Numerical Jacobian Audit"
allowed-tools: Bash(python *)
---
# Reproduction Steps
1. Install dependencies:
```bash
pip install cmdstanpy numpy scipy pandas
python -c "import cmdstanpy; cmdstanpy.install_cmdstan()"
```
2. Collect Stan models: Clone the Stan case studies and posteriordb repositories.
```bash
git clone https://github.com/stan-dev/example-models.git
git clone https://github.com/stan-dev/posteriordb.git
```
3. Compile and run gradient check for each model:
```python
import cmdstanpy
import numpy as np
import json
def gradient_check(stan_file, data_file, n_params_check=10, step_sizes=[1e-5, 1e-6, 1e-7, 1e-8]):
"""Compare AD gradient vs finite differences for a Stan model."""
model = cmdstanpy.CmdStanModel(stan_file=stan_file)
# Get log_prob and gradient via model.log_prob()
# Use Stan's diagnose mode for built-in gradient checking
diag = model.diagnose(data=data_file)
return diag
def central_fd_gradient(log_prob_fn, theta, h=1e-6):
"""Central finite difference gradient."""
grad = np.zeros_like(theta)
for j in range(len(theta)):
theta_plus = theta.copy()
theta_minus = theta.copy()
theta_plus[j] += h
theta_minus[j] -= h
grad[j] = (log_prob_fn(theta_plus) - log_prob_fn(theta_minus)) / (2 * h)
return grad
def relative_disagreement(g_ad, g_fd, eps=1e-10):
"""Compute relative disagreement between AD and FD gradients."""
return np.abs(g_ad - g_fd) / (np.abs(g_ad) + np.abs(g_fd) + eps)
```
4. Classify gradient disagreement causes by inspecting Stan code for:
- Custom functions with numerical integration (discrete approximations)
- Missing log_sum_exp, log1p, or similar stability tricks
- Parameters initialized near constraint boundaries
- ODE solver usage
5. Run sampling and compare diagnostics between flagged/unflagged models:
```python
def run_and_diagnose(model, data_file):
fit = model.sample(data=data_file, chains=4,
iter_warmup=1000, iter_sampling=1000)
summary = fit.summary()
diagnostics = {
'warmup_iters': fit.metric.shape, # proxy
'divergences': fit.diagnose(),
'max_rhat': summary['R_hat'].max(),
'min_ess_per_sec': summary['ESS/s'].min()
}
return diagnostics
```
6. Expected output: ~23% of models show >1% relative gradient disagreement. Flagged models have ~40% longer warmup (450 vs 320 iterations) and ~2.1x higher divergence rates. Primary causes: discrete approximations (34%), unstable parameterizations (28%), near-boundary evaluations (22%).
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.