What is the (fully rigorous) definition of a confidence interval?

Question

In a nutshell: what is the (fully rigorous) definition of a confidence interval?

In page $92$ of Wasserman's All of Statistics, it is written that

A $1 − α$ confidence interval for a parameter $θ$ is an interval $C_n = (a, b)$ where $a = a(X_1,...,X_n)$ and $b = b(X_1,...,X_n)$ are functions of the data such that $$P_θ(θ ∈ C_n) ≥ 1 − α, \ \ \ \ \text{ for all } θ ∈ Θ.$$ In words, $(a, b)$ traps $θ$ with probability $1 − α$. We call $1 − α$ the coverage of the confidence interval. Warning! $C_n$ is random and $θ$ is fixed.

I cannot understand the expression $P_\theta(\theta\in C_n)$. In general, if we have a random variable $X:(\Omega,\mathcal{F},P)\to (\mathbb{R},\mathcal{B})$, we define $$P(X\in S) := X_*P(S) = P(X^{-1}(S))$$ for $S$ in the Borel $\sigma$-algebra $\mathcal{B}$. Note the expression "$P(X\in S)$" requires that

$X$ be a random variable.
$S$ be a fixed set.

Neither of these conditions seem to be met with the expression "$P(\theta\in C_n)$", as

$\theta$ is an element of the parameter space $\Theta$, which is itself a subset of $\mathbb{R}^n$ for some $n$. That is, it seems to me that $\theta$ is a (fixed) vector, not a function (and thus not a random variable either).
As $a$ and $b$ are functions of $X_1,\ldots,X_n$, the interval $C_n := (a,b)$ seems to be "variable", when it should it be a fixed set for the expression to make sense.

In case it is relevant, on page $89$ Wasserman explains that

... $P_\theta(X\in A) = \int_A f(x;\theta) dx$ ...

which does make sense, as $\theta$ is fixed here, so that $f(x;\theta)$ is some random variable, while $A$ is a fixed set. However, the author is using the expressio $P_\theta$ differently in the main (first) quote of this post.

I remain confused after reading this and this post.

It might help you to study some examples of confidence intervals to see how this definition makes sense. — StubbornAtom, May 29 '24 at 10:15
In your "random variable $X:(\Omega,\mathcal{F},P)\to (\mathbb{R},\mathcal{B})$" it would be better to have $\to (E,\mathcal{B})$; here $E$ is the set of possible confidence intervals. You then have $C=X$ (a random variable) while $S$ is the subset of $E$ of confidence intervals that cover $\theta$ (fixed for any particular $\theta$). — Henry, May 29 '24 at 11:07
@Henry what would $\mathcal{B}$ be then? It is supposed to be a $\sigma$-algebra of $E$, so the Borel $\sigma$-algebra of $\mathbb{R}$ won't do. — Sam, May 29 '24 at 12:24
If you see a confidence interval as described by a pair of real numbers then $E$ can be seen as equivalent to a subset of $\mathbb R^2$ and you can use a subset of $\mathscr B(\mathbb R)\times \mathscr B(\mathbb R)$ — Henry, May 29 '24 at 15:56
Does this answer your question? Definition of confidence interval — Snoop, May 29 '24 at 17:36
I guess another way to see the difference between a fixed set $S$ and a random set $C(\omega)$ is to compare $$P[{\omega \in \Omega : \theta \in S}] \in {0,1}$$ $$P[{\omega \in \Omega : \theta \in C(\omega)}]$$ where the first is true since the set ${\omega \in \Omega : \theta \in S}$ is equal to $\Omega$ if $\theta \in S$, and equal to the empty set otherwise. In either case, it is the same $P:\mathcal{F}\rightarrow [0,1]$ that is being used, just like it is the same $P$ that is used to evaluate $P[X \in A]$ and $P[\theta \in C_n]$. — Michael, May 30 '24 at 15:31
Somewhat orthogonally to your question: do note that Wasserman's book is meant to be a quick sketch through the field (and does an admirable job at this). If you're interested in fleshing out the details, pick up a text on mathematical statistics instead, "All of Statistics" is not the right source. — stochasticboy321, May 31 '24 at 01:46
@Snoop I think the old question you provided is related but not a duplicate of the current question because that question presents three specific options and mainly asks which of the three options A, B, or C is correct, and the existing answers discuss the three options. — Amir, Jun 02 '24 at 16:44

Amir · Accepted Answer · 2024-06-03T08:49:51.907

Let $$T_1=g_1(X_1,\dots,X_n)$$ $$T_2=g_2(X_1,\dots,X_n)$$ be two statistics where $$X_1,\dots,X_n\sim F_\theta$$ for some unknown parameter $\theta \in \Theta$. Then, $[T_1,T_2]$ is called a confidence interval (also called interval estimator) with confidence level $1-\alpha$ for parameter $\theta$ if

$$T_1\le T_2$$ $$\mathbb P\left ( [T_1,T_2] \ni \theta \right )=1-\alpha, \, \theta \in \Theta.$$

Note that the random interval $[T_1,T_2]$ includes the constant parameter $\theta$ with probability $1-\alpha$.

PS1: The above can be extended for a vector of parameters as a confidence region defined based on $X_1,\dots,X_n$, which can be a hypercube, an ellipsoid, etc.

PS2: In some cases, it is impossible to find $T_1 \le T_2$ for which all the (coverage) probabilities $\mathbb P\left ( [T_1,T_2] \ni \theta \right ),\, \theta \in \Theta$ are equal, specifically for discrete distributions for which no exact (v.s. asymptotic) pivotal quantity is available. Hence, a possible extension of the concept of confidence interval is to consider the following weaker condition

$$\inf_{\theta \in \Theta}\left \{\mathbb P\left ( [T_1,T_2] \ni \theta \right ) \right \}=1-\alpha$$

instead of $\mathbb P\left ( [T_1,T_2] \ni \theta \right )=1-\alpha, \theta \in \Theta$.

When $\inf_{\theta \in \Theta}\left \{\mathbb P\left ( [T_1,T_2] \ni \theta \right ) \right \}\ge 1-\alpha$, in some references, $1-\alpha$ is considered as the confidence level of the interval $[T_1,T_2]$ (what we called confidence level above is sometimes called confidence coefficient of $[T_1,T_2]$ in these references). This definition of confidence level can be confusing and should be used carefully because if the confidence level of an interval is $0.95$ (in this sense), then its confidence level can be also $0.90$, $0.93$, or any other number less than 0.95. Also, according to this definition, the set $\Theta$ (whose confidence coefficient is $1$) is a confidence set whose confidence level can be from $0$ to $1$.

PS3: When the confidence level $1-\alpha$ is guaranteed as $n\to \infty$, then $[T_1, T_2]$ is called an asymptotic confidence interval. It is a useful concept because for the parameters of discrete distributions, it is often possible to construct asymptotic pivotal quantities, e.g. by CLT.

PS4: To compute the probability $\mathbb P\left ( [T_1,T_2] \ni \theta \right )$, one can use each of the original measure $P_\theta$, pushforward measure $X_*(P_\theta)$ of $P_\theta$ by vector $X=(X_1,\dots,X_n)$, or pushforward measure $T_*(X_*(P_\theta))$ of $X_*P_\theta$ by vector $T=(T_1,T_2)$ as follows: $$\mathbb P\left ( [T_1,T_2] \ni \theta \right )=P_\theta \left ( X^{-1}\left(T_1^{-1} (-\infty,\theta] \cap T_2^{-1} [\theta, \infty) \right )\right )=X_*(P_\theta) \left (T_1^{-1} (-\infty,\theta] \cap T_2^{-1} [\theta, \infty)\right )=T_*(X_*(P_\theta)) \left ((-\infty,\theta] \times [\theta, \infty)\right ).$$

When $X_1,\dots,X_n$ are independent, which is the case for a random sample, $X_*(P_\theta)$ is the product measure $\bigotimes_{i = 1}^{n} {X_{i}}_*(P_\theta)$.

Thank you, it's all starting to make sense now, but what is the meaning of $P(\theta\in [T_1,T_2])$? Is it merely $P(T_1^{-1}(-\infty,\theta]\cap T_2^{-1}[\theta,\infty))$? That is, we are not using neither pushforward measure $(T_1)*P$ nor $(T_2)*P$ when defining $P(\theta\in [T_1,T_2])$, right? — Sam, May 30 '24 at 14:09
@Sam Yes, being more precise, we can simply use the product measure $P_{n,\theta}$ obtained from the pushforward measures of $X_1, \dots, X_n $ to compute the probability $\mathbb P\left ( [T_1,T_2] \ni \theta \right )$ as $P_{n,\theta} \left (T_1^{-1} (-\infty,\theta] \cap T_2^{-1} [\theta, \infty)\right ).$ — Amir, May 30 '24 at 14:42
This is using a pushforward, specifically of the distribution of $X_1^n$ through the map $\mathbf{T}: \mathbb{R}^n \to \mathbb{R}^2$ which sends $x_1^n$ to $(T_1(x_1^n), T_2(x_1^n))$ (where I'm denoting a vector, and not an interval, although one could define it the latter way if so desired). This is the standard way to develop joint distributions of more than one statistic, of course. In any case, note that the concept of a confidence interval is not at all tied to a product measure structure. — stochasticboy321, May 31 '24 at 01:45
Sure, to compute $\mathbb P\left ( [T_1,T_2] \ni \theta \right )$ each of the original measure $P_\theta$, pushforward $X_(P_\theta)$ of $P_\theta$ by vector $X=(X_1,\dots,X_n)$, or pushforward $T_(X_(P_\theta))$ of $X_P_\theta$ by vector $(T_1,T_2)$, can be used as follows: $$\mathbb P\left ( [T_1,T_2] \ni \theta \right )=P_\theta \left ( X^{-1}\left(T_1^{-1} (-\infty,\theta] \cap T_2^{-1} [\theta, \infty) \right )\right )=X_(P_\theta) \left (T_1^{-1} (-\infty,\theta] \cap T_2^{-1} [\theta, \infty)\right )=T_(X_*(P_\theta)) \left ((-\infty,\theta] \times [\theta, \infty)\right ).$$ — Amir, May 31 '24 at 08:37
When $X_1,\dots,X_n$ are independent, which is the case for a random sample, $X_(P_\theta)$ is the product measure $\bigotimes_{i = 1}^{n} {X_{i}}_(P_\theta)$, denoted by $P_{n,\theta}$ in my earlier comment. — Amir, May 31 '24 at 08:38

Mittens · Answer 2 · 2024-05-30T15:52:48.797

Here is a slightly more general notion of coinfidence set. At issue is that statements such as $P[\theta \in C(X)]$ are not really probabilistic statements about $\theta$, since in the classical (frequentist) paradigm, the parameters are deterministic (although unknown).

We use a little bit of measure theory, in particular, the notion of Cartesian product of $\sigma$-algebras.

In the background, one has a measurable space $(\Omega,\mathscr{F})$ and a family of probability measures (for simplicity parametrized by indices $\theta$ in some nice set $\Theta$: $\{\mathbb{P}_\theta:\theta\in\Theta\}$.
Suppose that $\Theta$ itself is equipped with a $\sigma$-algebra $\mathscr{T}$. For example, in many applications $(\Theta,\mathscr{T})$ is a Borel subspace of $(\mathbb{R}^d,\mathscr{B}(\mathbb{R}^d))$.
Suppose $\mathbf{X}=(X_1,\ldots,X_n)$ is a sample of size $n$ of random variables $X_1,\ldots, X_n$ (i.i.d. for example), and that $T(X_1,\ldots, X_n)$ is some statistic with values in some measurable space $(E,\mathscr{E})$.
For any $A\in\mathscr{F}\otimes\mathscr{E}$ (the product $\sigma$-algebra of $\mathscr{F}$ and $\mathscr{E}$), $\omega_0\in\Omega$ and $t_0\in E$ define \begin{align} A_{\omega_0}&=\{t\in E: (\omega_0,t)\in A\}\\ A^{t_0}&=\{\omega\in\Omega: (\omega,t_0)\in A\} \end{align} It is easy to check that $A_{\omega_0}\in\mathscr{E}$ and $A^{t_0}\in\mathscr{F}$.
Suppose $\tau:(\Theta,\mathscr{T})\rightarrow(E,\mathscr{E})$ a function of the parameter $\theta$ and consider the function $$C:\Omega\times\Theta\rightarrow E\times E,\qquad C(\omega,\theta)=(T(\mathbf{X}(\omega)),\tau(\theta)).$$
For any set $D\in\mathscr{E}\otimes\mathscr{E}$, $C^{-1}(D)\in \mathscr{F}\otimes\mathscr{T}$. For any $\theta\in\Theta$ \begin{align} (C^{-1}(D))^\theta&=\{\omega\in\Omega: (\omega,\theta)\in C^{-1}(D)\}=\{\omega\in\Omega:(T(\mathbf{X}(\omega)),\tau(\theta))\in D\}\\ &=\{\omega\in\Omega:\tau(\theta)\in D^{T(\mathbf{X}(\omega))}\} \end{align}

Definition: The set $D^{T(\mathbf{X})}$ (or rather $C^{-1}(D)$) is coinfidence set for $\tau(\theta)$ of level $1-\alpha$ if $$\inf_{\theta\in\Theta} \mathbb{P}_{\theta}\big(\tau(\theta)\in D^{T(\mathbf{X})}\big)=\inf_{\theta\in \Theta}\mathbb{P}_\theta\big((C^{-1}(D))^\theta\big)\geq1-\alpha $$

Examples:

Suppose $\Theta=\mathbb{R}$, and $T$ a real-valued statistic on $\mathbf{X}$. Consider $C(\omega,\theta)=(\theta,T(\mathbf{X}(\omega))$ and $D=\{(\theta,t): \theta\leq t\}$. Notice that $$(C^{-1}(D))^\theta=\{\omega: \theta\leq T(\mathbf{X}(\omega))\} $$ Hence $(C^{-1}(D))^\theta$ is a (one-sided) coincidence interval for $\theta$ of level $1-\alpha$ if for any $\theta\in\Theta$ $$\mathbf{P}_\theta\{\omega:\theta\leq T(\mathbf{X}(\omega))\}]\geq1-\alpha$$
Suppose $\Theta=\mathbb{R}$, and $T_1$ and $T_2$ real-valued statistics on $\mathbf{X}$ such that $T_1(\mathbf{X})\leq T_2(\mathbf{X})$. Consider $C(\omega,\theta)=\big(\theta,T_1(\mathbf{X}(\omega), T_2(\mathbf{X}(\omega)\big)$ and $D=\{(\theta,t_1,t_2): t_1\leq \theta\leq t_2\}$. Notice that $$(C^{-1}(D))^\theta=\{\omega: T_1(\mathbf{X}(\omega))\leq \theta\leq T_2(\mathbf{X}(\omega))\} $$ Hence $(C^{-1}(D))^\theta$ is a (two-sided) coincidence interval for $\theta$ of level $1-\alpha$ if for any $\theta\in\Theta$ $$\mathbf{P}_\theta\{\omega:T_1(\mathbf{X}(\omega))\leq \theta\leq T_2(\mathbf{X}(\omega))\}]\geq1-\alpha$$

Thera are other ways to consider coinfidence sets based on what is called pivot, but I will let the OP do a little research into that.

score 3 · Answer 3 · answered May 30 '24 at 00:50

Just to get some notation really explicit, $\{P_\theta : \theta \in \Theta\}$ is an indexed family of probability distributions on $\mathbf{R}^n$, and $a$ and $b$ are measurable functions $\mathbf{R}^n \rightarrow \mathbf{R}$.

The space of open intervals can be identified with $\{(p, q) \in \mathbf{R}^2 : p < q\} \subseteq \mathbf{R}^2$. Indeed, we're forced into this identification, because the induced $\sigma$-algebra on this space is the one that ensures that the function $x \mapsto C_n(x) = (a(x), b(x))$ is measurable.

So now, we can write explicitly in the language of pushforward measures to emphasise what is random and what is fixed: \begin{align*} P_\theta(\theta \in C_n) &= P_\theta\{x : a(x) < \theta < b(x)\} \\ &= P_\theta\{x : (a(x), b(x)) \in (-\infty, \theta) \times(\theta, \infty)\} \\ &= (C_n)_* P_\theta((-\infty, \theta) \times(\theta, \infty)). \end{align*}

Michael · Answer 4 · 2024-05-30T02:46:33.403

This adds minor measure theory details to Amir's answer (which I have upvoted). For the probability space $(\Omega, \mathcal{F}, P)$, it is not $\theta$ that is random, but the set $C_n$ itself:

$\theta$ is a deterministic constant.
$X_1, ..., X_n$ are random variables (so each $X_i:\Omega\rightarrow\mathbb{R}$ is measurable).
Assume $a:\mathbb{R}^n\rightarrow\mathbb{R}$ and $b:\mathbb{R}^n\rightarrow\mathbb{R}$ are measurable functions that satisfy $a(x_1, ..., x_n)<b(x_1, ..., x_n)$ for all $(x_1, ..., x_n)\in\mathbb{R}^n$.
Define $A=a(X_1, ..., X_n)$ and $B=b(X_1, ..., X_n)$. Since $a, b$ are measurable functions, it follows that $A$ and $B$ are random variables.
Define the random set $C_n=(A,B)$.

Then $P[\theta \in C_n]$ can be equivalently written in the following ways: $$P[A<\theta < B]$$ $$P[\{A<\theta\}\cap \{B>\theta\}]$$ $$P[A^{-1}((-\infty, \theta))\cap B^{-1}((\theta, \infty))]$$ This is the probability of the intersection of two clearly defined events. The intersection of two events is again an event (because the set of events $\mathcal{F}$ is a sigma algebra). Every event has a probability because the probability measure is a function $P:\mathcal{F}\rightarrow [0,1]$. The set of events $\mathcal{F}$ can contain many more events than just the events of the type $A^{-1}(S)$ for some Borel subset $S\subseteq\mathbb{R}$ and some particular random variable $A$.

Generally speaking, the reason we define random variables, say, $Y:\Omega\rightarrow\mathbb{R}$, $W:\Omega\rightarrow\mathbb{R}$, $Z:\Omega\rightarrow\mathbb{R}$, is that it makes it easy to talk about events: $$ \{Y\leq 6.5\} \in \mathcal{F}$$ $$ \{Y\leq 6.5\}\cap \{Z\leq 4\} \in \mathcal{F}$$ $$\{Y<W\}\cup\{Z\leq 4\} \in \mathcal{F}$$ $$\{Y+W\leq 5\}\cap\{WZ=8\}^c\cap\{\cos(Z)>0.7\} \in \mathcal{F}$$ This uses the fact that $\mathcal{F}$ is a sigma algebra, and that a continuous function of a finite number of random variables is again a random variable. In particular, it holds without regard to dependence/independence between $Y, W, Z$.

Tom Loredo · Answer 5 · 2024-05-31T05:47:25.003

The other answers clarify the math in Larry's definition. I'd like to make some corrections in the descriptive text.

In words, $(a,b)$ traps $\theta$ with probability $1 - \alpha$.

As is clear from the (correct) mathematical definition on the previous line, this should read:

In words, $(a,b)$ traps $\theta$ with probability at least $1 - \alpha$.

A related error is here:

We call $1-\alpha$ the coverage of the confidence interval.

Actually, the usual definition of coverage is the $\theta$-dependent probability, $$ K(\theta) \equiv P_θ(θ ∈ C_n), $$ or, a bit more explicitly, $$ K(\theta) \equiv P_θ\left(a(\vec{X}) \le θ \le b(\vec{X})\right), $$ with $P_\theta$ denoting the probability distribution for the sample vector $\vec{X}$ when the value of the parameter $\theta$ is specified (i.e., $\vec{X}$ is random; $\theta$ is a fixed quantity).

With these definitions, the interval estimator $[a(\vec{X}), b(\vec{X})]$ has confidence level $$ \text{CL} = \inf_\theta K(\theta). $$ That is, a confidence interval is an interval-valued statistic that gives a conservative guarantee of coverage: the coverage is at least $\text{CL}$.

The reason for this somewhat convoluted construction is that for most sampling distributions, the coverage of an interval estimator will depend on $\theta$, the unknown parameter one is trying to estimate. With $\theta$ unknown, if the coverage $K(\theta)$ is not constant, one does not know what the actual coverage of the interval estimator would be for replications relevant to the situation at hand (repeated sampling with the value of $\theta$ fixed at the unknown value that generated the actually observed sample). But if the interval estimator lets you usefully bound the coverage from below, then you can guarantee the user at least a certain amount of coverage. That bound is the confidence level.

So it is important to distinguish the simple notion of coverage (which typically cannot be guaranteed) from the more complex notion of confidence level.

There is a lengthy discussion of this in the Mathematica Journal (which I confess I have not read in its entirety): Coverage versus Confidence « The Mathematica Journal.

One quite general setting where the difference between coverage and CL shows up is when the parameter space is continuous and the sample space is discrete. E.g., for binomial inference—coin flips with an unknown probability for heads—the discreteness of the sample space (the number of heads) makes the coverage jump discontinuously as a function of the parameter (the probability for heads). In the particle physics literature (where they care a lot about coverage), they have a colorful name for the jagged plots of coverage vs. parameter in such settings: they call them dinosaur plots (think of the profile of a Stegosaurus!). You can see some examples in the Mathematica Journal article, though the dinosaur shape is not very prominent. A better example is the Poisson distribution example at the end of these slides by John Conway: Intervals and Coverage (UC Davis Physics 252C). For a published (thus citable) version (but without the "dinosaur plot" terminology!), see Fig. 2 in this paper by astronomer Niel Gehrels (1986): Confidence Limits for Small Numbers of Events in Astrophysical Data - NASA/ADS.

Aside: Your closing comments, from the perspective of $\theta$ being random and $C_n$ being a fixed interval, corresponds to the notion of a credible interval in Bayesian inference, where "random" means "uncertain in the particular case at hand" (not necessarily "variable across replications"). Since you asked about confidence intervals, not credible intervals, I won't go into the details, except to note that computing a credible interval requires specifying a prior distribution for $\theta$, and instead of guaranteeing minimum coverage, the probability associated with a credible interval tells you the average coverage (exactly, not as a bound) for replications in the joint $(\theta,\vec{X})$ space, where $\theta$ is sampled according to the prior, and then $\vec{X}$ is sampled from $P_\theta(\vec{X})$ given the sampled value of $\theta$. In a sense, the (frequentist) confidence interval treats the world as adversarial, and tries to give you guarantees against the worst case, while Bayesian credible intervals seek good performance on average, rather than against the worst case (which requires you to say something about what "on average" means, via the prior). There is an insightful discussion of the relationship between confidence and credibility in this lovely paper by Jim Berger and Susie Bayarri (2004): The Interplay of Bayesian and Frequentist Analysis.

What is the (fully rigorous) definition of a confidence interval?

5 Answers5