Poor performance of Wald CIs for proportions

Author

Jacob Goldstein-Greenwood

Published

2025-04-22

Andersson, 2023, The American Statistician, “The Wald confidence interval for a binomial p as an illuminating ‘bad’ example.” DOI: 10.1080/00031305.2023.2183257.

“The standard two-sided Wald interval for the binomial p is by now well known to be of low quality with respect to actual coverage and it ought to be stricken from all courses in mathematical statistics/statistics as the recommended interval to use. However, it still lingers on in quite a few textbooks.”

The Wald confidence interval (CI) for a proportion is erratic, often straying markedly from the nominal coverage rate. Received wisdom is that the interval’s performance decays as p drifts from 0.5 toward 0 or 1 and/or as n decreases. Brown et al. (2001, Statistical Science) note that it’s more accurate to characterize the Wald interval as being subject to “lucky” and “unlucky” (p, n) pairings. For example, they show that, surprisingly, the 95% Wald interval’s coverage is higher (and closer to 95%) for (p = 0.2, n = 30) than for (p = 0.2, n = 98).

I simulated data from binomial distributions at a range of (p, n) combinations and estimated 95% Wald CIs and 95% Wilson CIs for the resulting proportions. I repeated the process 2500 times for each (p, n) pairing. The coverage rates—the fraction of intervals of each type that captured p for each set of (p, n) values—are below:

Show code

library(ggplot2)

# Set simulation conditions
set.seed(999)
props <- seq(.05, .95, by = .025)
n <- c(25, seq(50, 550, by = 100))
sim_conditions <- expand.grid(props, n)
colnames(sim_conditions) <- c('prop', 'n')
res <- vector('list', nrow(sim_conditions))
reps <- 2500
z_975 <- qnorm(.975)

# Run
#   Formula reference for Wilson CI:
#   https://www.itl.nist.gov/div898/handbook/prc/section2/prc241.htm
for (i in 1:nrow(sim_conditions)) {
  res[[i]] <- replicate(reps, {
    draws <- rbinom(sim_conditions$n[i], 1, sim_conditions$prop[i])
    prop <- mean(draws)
    
    wald_CI_width <- z_975 * sqrt((prop*(1-prop)) / sim_conditions$n[i])
    wald_CI <- prop + c(-wald_CI_width, wald_CI_width)
    wald_captures <- ifelse(sim_conditions$prop[i] <= wald_CI[2] &&
                              sim_conditions$prop[i] >= wald_CI[1], T, F)
    
    wilson_CI_width <- z_975*sqrt((prop*(1-prop))/sim_conditions$n[i] +
                                    z_975^2/(4*sim_conditions$n[i]^2))
    wilson_CI <- (prop + z_975^2/(2*sim_conditions$n[i]) +
                    c(-wilson_CI_width, wilson_CI_width)) / (1 + z_975^2/sim_conditions$n[i])
    wilson_captures <- ifelse(sim_conditions$prop[i] <= wilson_CI[2] &&
                                sim_conditions$prop[i] >= wilson_CI[1], T, F)
    
    c('Wald' = wald_captures, 'Wilson' = wilson_captures)
  })
}

# Organize results
sim_conditions$coverage_Wald <- sapply(res, \(x) mean(x[1, ]))
sim_conditions$coverage_Wilson <- sapply(res, \(x) mean(x[2, ]))

d <- tidyr::pivot_longer(sim_conditions,
                         cols = c('coverage_Wald', 'coverage_Wilson'),
                         values_to = 'coverage',
                         names_to = 'method',
                         names_prefix = 'coverage_')

d$n <- as.factor(d$n)

# Plot
ggplot(d, aes(prop, coverage, color = n)) +
  geom_point() +
  geom_line() +
  facet_wrap(~method) +
  scale_x_continuous(breaks = seq(.05, .95, by = .15)) +
  ylim(.5, 1) +
  labs(x = 'Population proportion',
       y = '95% CI coverage rate',
       title = '95% Wald and Wilson CI coverage rates for proportions',
       color = 'N',
       caption = 'Each point reflects 2500 simulations') +
  theme(legend.position = 'bottom') +
  guides(color = guide_legend(nrow = 1)) +
  geom_hline(yintercept = .95)

The simulation results show, as noted by Brown et al. (2001) and Andersson (2023), that the Wald interval’s performance is jagged: It does not decrease monotonically as p approaches 0 or 1, nor does it decrease monotonically as n decreases. That one can find occasional “lucky” (p, n) combinations does not, though, make the picture rosy: The Wald CI has broadly decayed coverage, and many more of the Wald CI coverage rates calculated here fall below 95% compared to the Wilson CI coverage rates:

Show code

under_coverage_prop <- aggregate(d$coverage,
                                 list(d$method),
                                 \(x) round(mean(x < .95), digits = 3))

knitr::kable(under_coverage_prop,
             col.names = c('CI type', 'Proportion of coverage rates < .95'))

CI type	Proportion of coverage rates < .95
Wald	0.792
Wilson	0.456

The Wilson interval is not alone as a (preferable) alternative to the Wald interval for proportions, but it is a fair go-to. It received complimentary examinations in Andersson (2023) and Brown et al. (2001), and it is Harrell’s recommended (frequentist) interval as of April 2025 in Biostatistics for Biomedical Research.