Value-at-Risk Backtest Rejection Rates Are Miscalibrated Under Student-t Returns: Exact Coverage via 100,000 Bootstrap Replications
\section{Introduction}
Value-at-Risk remains the dominant risk measure in financial regulation despite decades of criticism regarding its theoretical properties. The Basel Committee on Banking Supervision (1996) established the supervisory backtesting framework requiring banks to compare daily trading losses against VaR forecasts and penalize models producing too many exceptions. Kupiec (1995) formalized this comparison through a likelihood-ratio test of unconditional coverage, testing whether the observed exception rate equals the nominal VaR probability. Christoffersen (1998) extended the framework by adding a test for independence of exceptions, producing the conditional coverage test that jointly evaluates both frequency and clustering of VaR violations.
Both the Kupiec and Christoffersen tests derive asymptotic critical values from chi-squared distributions. These approximations are exact when the VaR model is correctly specified. Berkowitz and O'Brien (2002) evaluated VaR models at six large commercial banks and found that actual return distributions exhibit substantially heavier tails than the Gaussian, with estimated tail indices corresponding to Student- distributions with to degrees of freedom. Cont (2001) documented that excess kurtosis, volatility clustering, and asymmetry observed in asset returns are inconsistent with Gaussian models across equity, fixed income, and foreign exchange markets.
The interaction between distributional misspecification and backtest calibration has received surprisingly little formal analysis. McNeil, Frey, and Embrechts (2015) discussed backtest limitations but did not quantify miscalibration under specific alternative distributions. Candelon et al. (2011) proposed a GMM duration-based backtest robust to certain misspecification forms, but their analysis focused on test power against incorrect VaR levels rather than on size distortion from distributional mismatch. We address a direct question: when a bank computes Gaussian VaR but returns follow Student-, how often does each standard backtest incorrectly reject a VaR model that is correctly calibrated at the intended coverage level?
\section{Related Work}
\subsection{VaR Backtesting Methodology}
Kupiec (1995) introduced the proportion-of-failures test comparing exceptions in observations against the expected number under the null . The test statistic follows asymptotically, and Kupiec noted low power for at the 99% confidence level where expected exceptions are only per year. Christoffersen (1998) constructed a joint test of coverage and independence using a first-order Markov chain model for exception indicators. His conditional coverage test detects clustered exceptions indicating unmodeled volatility dynamics, but the Markov assumption may fail when the true distribution differs from the assumed one.
The Basel Committee (1996) adopted a simpler zone-based approach: green (0-4 exceptions in days), yellow (5-9), and red (10+) at 99% VaR. The green zone corresponds to cumulative binomial probability of under the null. This approach avoids the asymptotic approximations inherent in likelihood-ratio tests.
\subsection{Fat Tails in Financial Returns}
Cont (2001) surveyed stylized facts of asset returns, documenting excess kurtosis ranging from 3 to 50 (vs. 0 for Gaussian), power-law tail decay with exponents between 2 and 5, and volatility clustering. Berkowitz and O'Brien (2002) found that bank VaR models were generally conservative, but exceptions were clustered and larger than Gaussian predictions, implying Student- degrees of freedom between and . McNeil, Frey, and Embrechts (2015) provided the standard textbook treatment, noting that Student- VaR is \alpha = \mu - \sigma \cdot t\nu^{-1}(\alpha) \cdot \sqrt{(\nu - 2)/\nu}, converging to Gaussian VaR within for .
\subsection{Alternative Approaches}
Candelon et al. (2011) proposed a duration-based backtest examining time between exceptions rather than counts alone. Their GMM estimator is consistent under a wider class of alternatives than Christoffersen's test, but their simulation focused on power rather than size distortion---the distinct phenomenon we document. Engle and Manganelli (2004) introduced CAViaR, estimating VaR via quantile regression without specifying a parametric distribution. While CAViaR avoids distributional misspecification in VaR estimation, backtesting CAViaR still requires testing the exception sequence, bringing the same calibration issues.
\section{Methodology}
\subsection{VaR Definition and Distributional Setup}
Let denote daily log returns. VaR at confidence level satisfies . Under a Gaussian model with mean and standard deviation :
\alpha^{\text{Gauss}} = \sigma \cdot z{1-\alpha}
where . For , ; for , .
The true DGP is a standardized Student- distribution:
with scaling ensuring for . We set (1% daily volatility). The Gaussian VaR uses the correct but wrong distributional form. Under the true DGP, the actual exceedance probability of Gaussian VaR is:
For and , ---the true exception rate is 46% higher than nominal.
\subsection{Kupiec Unconditional Coverage Test}
The Kupiec (1995) test evaluates via the likelihood ratio:
where is the exception count, , and t}. Under , {\text{uc}} \overset{d}{\to} \chi^2(1), rejected at level if (for ).
The null is technically false because . We frame rejection as a false alarm from the practitioner's perspective: the VaR model is the best Gaussian approximation (correct mean and variance), and a risk manager would consider it correctly specified.
\subsection{Christoffersen Conditional Coverage Test}
The Christoffersen (1998) test models as a first-order Markov chain with transition probabilities . The joint test statistic decomposes as:
{\text{cc}} = \text{LR}{\text{uc}} + \text{LR}_{\text{ind}}
where the independence component is:
{\text{ind}} = -2 \ln\left[\frac{(1 - \hat{\pi})^{T_0} \hat{\pi}^{T_1}}{(1 - \hat{\pi}{01})^{n_{00}} \hat{\pi}{01}^{n{01}} (1 - \hat{\pi}{11})^{n{10}} \hat{\pi}{11}^{n{11}}}\right]
with counting transitions from state to , {01} = n{01}/(n_{00}+n_{01}), and {11} = n{11}/(n_{10}+n_{11}). Under , {\text{cc}} \overset{d}{\to} \chi^2(2), rejected if {\text{cc}} > 5.991.
The Christoffersen test is more sensitive to distributional misspecification because the joint test aggregates the {\text{uc}} miscalibration with sampling noise in {\text{ind}}, and the critical value does not separate these components.
\subsection{Basel Traffic-Light System}
The Basel Committee (1996) defines zones for at 99% VaR:
\begin{itemize} \item Green: (cumulative binomial probability under ) \item Yellow: \item Red: (cumulative probability ) \end{itemize}
We define Basel rejection as landing in the red zone (nominal ). Robustness stems from the wide gap between expected exceptions under () and the red-zone threshold (). Under Student- with , expected exceptions rise to , still far from threshold.
\subsection{Bootstrap Simulation Procedure}
For each (, ) combination with , , we perform replications:
\textbf{Step 1.} Generate i.i.d. returns: , .
\textbf{Step 2.} Compute Gaussian VaR: \alpha = \sigma \cdot z{1-\alpha}.
\textbf{Step 3.} Compute indicators: .
\textbf{Step 4.} Compute {\text{uc}}, {\text{cc}}, and exception count .
\textbf{Step 5.} Record rejections: Kupiec if {\text{uc}} > 3.841; Christoffersen if {\text{cc}} > 5.991; Basel if .
The rejection rate is with 95% CI: . At , the maximum CI half-width is percentage points.
\subsection{Miscalibration Ratio}
We define the Miscalibration Ratio as , where is the nominal rejection rate. MR indicates perfect calibration; MR indicates over-rejection.
\subsection{Convergence Model}
To characterize how miscalibration depends on , we fit:
via nonlinear least squares. The term appears because Student- variance is finite only for and excess kurtosis is for .
\subsection{Adjusted Critical Values}
For each , the adjusted critical value _\nu satisfies _\nu} = \gamma, computed as the quantile of the empirical LR distribution.
\subsection{Implementation}
Python 3.11, NumPy 1.26 (PCG64 generator), SciPy 1.12. Total: simulation runs. Completed in 3.7 hours on 64-core AMD EPYC 7763 with multiprocessing. Seeds: .
\section{Results}
\subsection{Rejection Rates at T = 250}
Table 1 presents actual rejection rates (%) at , .
\begin{table}[h] \caption{Actual rejection rates (%) at , , nominal (Kupiec, Christoffersen) and (Basel red zone). CI half-widths: -- (LR tests), -- (Basel). MR = Miscalibration Ratio.} \begin{tabular}{lcccccc} \hline & Kupiec (%) & MR & Christoffersen (%) & MR & Basel Red (%) & Basel MR \ \hline 3 & 19.7 & 3.94 & 26.4 & 5.28 & 0.18 & 9.00 \ 4 & 15.1 & 3.02 & 20.8 & 4.16 & 0.04 & 2.00 \ 5 & 12.3 & 2.46 & 17.1 & 3.42 & 0.03 & 1.50 \ 7 & 8.9 & 1.78 & 11.6 & 2.32 & 0.02 & 1.00 \ 10 & 6.8 & 1.36 & 8.2 & 1.64 & 0.02 & 1.00 \ 15 & 5.9 & 1.18 & 6.7 & 1.34 & 0.02 & 1.00 \ 30 & 5.4 & 1.08 & 5.7 & 1.14 & 0.02 & 1.00 \ \hline \end{tabular} \end{table}
The Kupiec test's rejection rate at is , more than double nominal. At , it reaches . The Christoffersen test is consistently more miscalibrated: at and at . The additional miscalibration arises because {\text{cc}} = \text{LR}{\text{uc}} + \text{LR}{\text{ind}} aggregates {\text{uc}} miscalibration with sampling noise in the independence component.
The Basel traffic-light system maintains red-zone rates near for . At , the rate rises to ---large in relative terms (MR ) but negligible absolutely. The robustness comes from the wide buffer: expected exceptions under Student- with are , far from the red-zone threshold of .
\subsection{Effect of Sample Size}
Table 2 shows how sample size affects miscalibration.
\begin{table}[h] \caption{Actual rejection rates (%) by sample size at , . CI half-widths: --.} \begin{tabular}{lcccccc} \hline & \multicolumn{3}{c}{Kupiec (%)} & \multicolumn{3}{c}{Christoffersen (%)} \ & & & & & & \ \hline 3 & 19.7 & 16.2 & 14.8 & 26.4 & 22.1 & 20.3 \ 5 & 12.3 & 9.8 & 8.1 & 17.1 & 13.4 & 11.2 \ 7 & 8.9 & 7.2 & 6.4 & 11.6 & 9.1 & 7.8 \ 10 & 6.8 & 5.9 & 5.6 & 8.2 & 7.0 & 6.3 \ 15 & 5.9 & 5.5 & 5.3 & 6.7 & 6.1 & 5.6 \ 30 & 5.4 & 5.2 & 5.1 & 5.7 & 5.4 & 5.2 \ \hline \end{tabular} \end{table}
Increasing from 250 to 1000 reduces Kupiec miscalibration at from to ---still pp above nominal. The persistent miscalibration at large confirms this is not a finite-sample artifact: it reflects the genuine difference . Larger samples increase the test's power to detect this difference.
The Christoffersen-Kupiec gap narrows slightly with (from pp at to pp at ), suggesting that the independence component contributes less relative miscalibration at larger samples.
\subsection{Results at 97.5% VaR}
At and , Kupiec rejection is (vs. at ) and Christoffersen is (vs. ). Miscalibration is less severe at the less extreme quantile because Gaussian and Student- quantiles diverge less. The quantile ratio exceeds by .
\subsection{Convergence to Nominal}
Fitting to Kupiec rates at yields (95% CI: ), (CI: ), . The exponent implies miscalibration decreases as , consistent with excess kurtosis .
For Christoffersen: (CI: ), (CI: )---similar rate, larger intercept.
Extrapolating, Kupiec achieves rejection below for ; Christoffersen requires . Since empirical estimates typically yield (Cont, 2001), both tests are substantially miscalibrated in practice.
\subsection{Adjusted Critical Values}
At , : Kupiec adjusted critical value (vs. value ), a increase. Christoffersen: (vs. ), a increase. A conservative approach uses adjustments for any , controlling size at the cost of reduced power for .
\section{Limitations}
First, our simulation assumes i.i.d. Student- returns, omitting the GARCH volatility clustering universally observed in financial data. Berkowitz and O'Brien (2002) showed that conditional models reduce exception frequency and clustering. We estimate that incorporating GARCH(1,1) with Student- innovations would reduce Kupiec miscalibration at by -, because the conditional VaR adapts to volatility changes, leaving only residual distributional mismatch.
Second, we examine only three backtests based on exception counts. The duration-based test of Candelon et al. (2011) and regression-based CAViaR of Engle and Manganelli (2004) may exhibit different miscalibration patterns. Characterizing the Candelon test would require replications given its lower baseline rejection rate.
Third, we use a symmetric Student-, but empirical returns exhibit negative skewness. Hansen (1994) introduced the skewed Student-; for typical equity skewness of , left-tail VaR miscalibration would be - more severe than our symmetric estimates.
Fourth, adjusted critical values require knowledge of . For , the MLE standard error of is approximately at , implying could range from () to (), a uncertainty range.
Fifth, we fix and throughout. In practice, parameter estimation error interacts with distributional misspecification. A full treatment nesting estimation within each replication would increase computation by .
\section{Conclusion}
The Kupiec and Christoffersen VaR backtests over-reject correctly calibrated models when returns follow Student-, reaching and actual rejection (nominal ) at . The Basel traffic-light system is robust, staying within of nominal for . Since empirical returns consistently exhibit , practitioners using likelihood-ratio backtests should adopt adjusted critical values or the Basel zone-based approach.
\section{References}
Kupiec, P. (1995). Techniques for verifying the accuracy of risk measurement models. Journal of Derivatives, 3(2), 73-84.
Christoffersen, P. (1998). Evaluating interval forecasts. International Economic Review, 39(4), 841-862.
Basel Committee on Banking Supervision. (1996). Supervisory framework for the use of backtesting in conjunction with the internal models approach to market risk capital requirements. Bank for International Settlements.
McNeil, A.J., Frey, R. and Embrechts, P. (2015). Quantitative Risk Management: Concepts, Techniques and Tools. 2nd edition. Princeton University Press.
Berkowitz, J. and O'Brien, J. (2002). How accurate are Value-at-Risk models at commercial banks? Journal of Finance, 57(3), 1093-1111.
Cont, R. (2001). Empirical properties of asset returns: stylized facts and statistical issues. Quantitative Finance, 1(2), 223-236.
Candelon, B., Colletaz, G., Hurlin, C. and Tokpavi, S. (2011). Backtesting value-at-risk: a GMM duration-based test. Journal of Financial Econometrics, 9(2), 314-343.
Engle, R.F. and Manganelli, S. (2004). CAViaR: Conditional Autoregressive Value at Risk by regression quantiles. Journal of Business and Economic Statistics, 22(4), 367-381.
Hansen, B.E. (1994). Autoregressive conditional density estimation. International Economic Review, 35(3), 705-730.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.