Sufficient statistics function for $N(\theta, c\theta^2)$ and symmetrical confidence interval using $\bar{X}$

Question

Exercise:

Let $X_1, \dots, X_n$ be a random sample from the Normal Distribution $N(\theta,c\theta^2)$ where $c > 0$ is a known constant and $\theta \in \mathbb R$ an unknown parameter.
i) Find a sufficient statistics function for $\theta$.
ii) Using only the statistics function $\bar{X}$, construct a $100(1 - a)\%$ confidence interval for $\theta$.

Attempt:

i)\begin{align*}p(x \mid c,\theta) &= \prod_{i=1}^n(2\pi c\theta^2)^{-1/2}\exp\big\{-(x_i-\theta)^2/(2c\theta^2)\big\}\\ &=(2\pi c\theta^2)^{-n/2}\exp\bigg\{-\frac{n}{2c\theta^2}\sum_{i=1}^n(x_i-\theta)^2\bigg\}\\ &=(2\pi c\theta^2)^{-n/2}\exp\bigg\{-\frac{n}{2c\theta^2}\bigg(\sum_{i=1}^nx_i^2 -2\theta\sum_{i=1}^nx_i+n\theta^2\bigg)\bigg\}. \end{align*} Thus, we can continue and figure out a sufficient statistics function by Fisher's factorization theorem.

(ii) How would one proceed by finding a confidence interval for $\theta$ as asked though?

Have you tried finding what the sufficient statistic for $\theta$ is? I tried myself... and it appears to be qutie difficult. — Clarinetist, Jun 07 '18 at 01:34
For $c=1$, we have the usual two dimensional sufficient statistic $T(\mathbf X)=(\sum X_i,\sum X_i^2)$. — StubbornAtom, Jun 17 '18 at 17:38
@Clarinetist Hey! I've added a thorough answer down below, after working around it I found the solutions. Hope it fits your bounty ! — Rebellos, Jun 17 '18 at 23:48
A comparison of two possible confidence intervals is given here: https://math.stackexchange.com/a/4893686/1231520 — Amir, May 04 '24 at 06:45

score 2 · Accepted Answer · edited Jun 18 '18 at 03:42

After some days, I managed to work around a complete answer and I am posting it for the sake of the bounty set by Clarinetist.

$\textbf{i)}$ $$ p(x \mid c,\theta) = \prod_{i=1}^n(2\pi c\theta^2)^{-1/2}\exp\big\{ -(x_i-\theta)^2/(2c\theta^2)\big\}$$

$$=$$

$$(2\pi c\theta^2)^{-n/2}\exp\bigg\{-\frac{n}{2c\theta^2}\sum_{i=1}^n(x_i-\theta)^2\bigg\}$$

$$=$$

$$(2\pi c\theta^2)^{-n/2}\exp\bigg\{-\frac{n}{2c\theta^2}\bigg(\sum_{i=1}^nx_i^2 -2\theta\sum_{i=1}^nx_i+n\theta^2\bigg)\bigg\}$$

$$=$$

$$(2\pi c \theta^2)^{-n/2}\exp\bigg\{-\frac{n}{2c\theta^2}\sum_{i=1}^nx_i^2+\frac{n}{c\theta}\sum_{i=1}^nx_i - \frac{n^2}{2c}\bigg\}$$

$$(2\pi c \theta^2)^{-n/2}\exp\bigg\{ -\frac{n}{2c\theta^2}\sum_{i=1}^nx_i^2+\frac{n}{c\theta}\sum_{i=1}^nx_i \bigg\}\cdot \exp\bigg\{-\frac{n^2}{2c}\bigg\}$$

Recall that the Fisher-Neyman Factorization Criterion mentions that if the probability function $p(\mathbf{x}\mid\theta)$ can be written as $p(\mathbf x\mid \theta)=G(\mathbf t,\theta)H(\mathbf x)$ where $\mathbf t(x) = (t_1(\mathbf x), \dots, t_k(\mathbf x))^\mathbf T$, then the function $\mathbf t(\mathbf x)$ is sufficient for the parameter $\theta$ over the statistics model $\{ X, \mathcal X, p(x\mid\theta), \theta \in \mathcal \Theta\}$.

For our specific case, consider the functions :

$$G(\mathbf t, \theta) =(2\pi c \theta^2)^{-n/2}\exp\bigg\{-\frac{n}{2c \theta^2}t_2(x) + \frac{n}{c\theta}t_1(x)\bigg\}$$

$$H(\mathbf x) = \exp\bigg\{-\frac{n^2}{2c}\bigg\}$$

Truly then, our probability function can be written as the product of these two, with $t_1(x)$ and $t_2(x)$, such that :

$$t_1(\mathbf x) = \sum_{i=1}^n x_i, \quad t_2(\mathbf x) = \sum_{i=1}^n x_i^2$$

Note that $c$ is a known constant, $c>0$ and that's why we can apply the Neyman-Fisher Factorization Criterion with it being in the expressions.

Thus, a sufficient statistics function for the given distribution model, is :

$$\mathbf t(x) = (t_1(\mathbf x), t_2(\mathbf x))^\mathbf T=\bigg(\sum_{i=1}^n x_i, \sum_{i=1}^n x_i^2\bigg)^\mathbf T$$

$\textbf{ii)}$

For a random sample $\mathbf X = (X_1, \dots, X_2)^\mathbf T$ from $\{\mathbf X, \mathbb R, \mathbf N(μ,σ^2),(μ,σ) \in \mathbb R \times \mathbb R^+\}$, we have :

$$T = \frac{\bar{X}-μ}{S/\sqrt{n}} \sim \mathbf{St}(n-1)$$

Thus, it is possible to find an interval $(c_1,c_2) \subset \mathbb R$, such that :

$$\mathbb P\bigg[c_1 < \frac{\bar{X}-μ}{S/\sqrt{n}} < c_2 \bigg] = 1-a$$

Because the distribution $t$ of Student is symmetrical around $0$, the interval $(c_1,c_2)$ has minimum length when $-c_1=c_2=t_{n-1,a/2}$, where $t_{n-1,a/2}$ such that $\mathbb P [ T > t_{n-1,a/2}] = \frac{1}{2}\mathbb P[|T| > t_{n-1,a/2}]=a/2$ with $T\sim \mathbf{St}(n-1)$. Thus, we have with probability $\gamma = 1-a$, the relation :

$$-t_{n-1,a/2} < \frac{\bar{X}-μ}{S/\sqrt{n}} < t_{n-1,a/2}$$

from which the $100 \; \gamma \; \%$ confidence interval for the mean $μ$ will be :

$$\bar{X}-t_{n-1,a/2}S/\sqrt{n} < μ < \bar{X} + t_{n-1,a/2}S/\sqrt{n}$$

In our specific exercise, it is $μ=\theta$ and thus the $100 \; \gamma = (1-a) \; \%$ will be :

$$\bar{X}-t_{n-1,a/2}S/\sqrt{n} < \theta < \bar{X} + t_{n-1,a/2}S/\sqrt{n}$$

where we have only used the statistics function $\bar{X}$, since our expression consists of $\bar{X}$ and also $S$, which is :

$$S = \sqrt{\frac{\sum_{i=1}^n (x_i-\bar{X})^2}{N-1}}$$

FIrst, it seems to be $\sum\limits_{k=1}^n(X_k-\overline X)^2$ instead of $\sum\limits_{k=1}^n(X_k-\overline X)$. Second, $\sum\limits_{k=1}^n(X_k-\overline X)^2$ is not a function of $\overline X$ alone. — Ѕᴀᴀᴅ, Jun 18 '18 at 01:38
The $x_i$-s are your sample. It's not a different statistics function. In the confidence interval only the statistics function $\bar{X}$ is used. — Rebellos, Jun 18 '18 at 01:42
Well, $\overline X=\dfrac1n\sum\limits_{k=1}^nX_k$ and in theoretical derivation, samples are treated as random variables or in other words, statistics. — Ѕᴀᴀᴅ, Jun 18 '18 at 01:43
If your perception is correct, then it could be claimed that $S$ is a function of samples only, not one of any statistics. — Ѕᴀᴀᴅ, Jun 18 '18 at 01:45
Take note that $\bar{X}$ is a different expression rather than a simple sample point, as used in $S$. There is no other way to approach it (other than using in one way or another the - rather undefined - CDF of the Normal Distribution). — Rebellos, Jun 18 '18 at 07:11
The description of the question implies that the interval can be determined as long as $\overline x$ is known, but your $S$ cannot be determined knowing $\overline x$ only. — Ѕᴀᴀᴅ, Jun 18 '18 at 08:35
@AlexFrancisco In my opinion, it can be determined, since you have a known random sample $X_1,\dots, X_n$ from the Normal Distribution. — Rebellos, Jun 18 '18 at 08:42
It cannot be determined if only the value of the summary (sufficient) statistic $\overline x$ is known. For example, what's the interval if $\overline x = 2$? — Ѕᴀᴀᴅ, Jun 18 '18 at 08:45
By the phrase "Let $X_1,\dots,X_n$ be a random sample from the normal distribution" I assume that we are working over a case of a known but random sample (thus no bias). That's why I consider that we have information about these random $x_i$-s. Other than that your point is correct as well, but still falls into the loop of an undetermined/non-common function which yields issues, espesially on applied matters. Overall I guess it's just a bad exercise. — Rebellos, Jun 18 '18 at 11:31
Not a bad exercise. And your CI depends on both $\bar X$ and $\sum X_i^2$, as @Saad argues. — StubbornAtom, Apr 13 '19 at 15:16
I don't think there should be a factor of $n$ in $$\exp\bigg{-\frac{n}{2c\theta^2}\sum_{i=1}^n(x_i-\theta)^2\bigg}$$ and the subsequent ones. — The Pointer, Apr 22 '21 at 01:10

score 1 · Answer 2 · answered Jun 18 '18 at 02:30

1

$\def\Φ{{\mit Φ}}\def\d{\mathrm{d}}$Note that$$ \overline{X} \sim N(nθ, cnθ^2) \Longrightarrow \frac{\overline{X} - nθ}{\sqrt{cn} θ} \sim N(0, 1). $$ Denote $ϕ(x) = \exp\left( -\dfrac{x^2}{2} \right)$, $\displaystyle \Φ(x) = \int_{-∞}^x ϕ(t) \,\d t$. For any $-\sqrt{\dfrac{n}{c}} < a < b$, because$$ a \leqslant \frac{\overline{X} - nθ}{\sqrt{cn} θ} \leqslant b \Longleftrightarrow \frac{\overline{X}}{b\sqrt{cn} + n} \leqslant θ \leqslant \frac{\overline{X}}{a\sqrt{cn} + n}, $$ to make a $(1 - α)$ confidence interval, it is equivalent to require that$$ \Φ\left( \frac{\overline{X}}{a\sqrt{cn} + n} \right) - \Φ\left( \frac{\overline{X}}{b\sqrt{cn} + n} \right) = α. $$ In particular, to make an unbiased confidence interval, an additional requirement is$$ \left( a + \sqrt{\frac{n}{c}} \right)ϕ(a) = \left( b + \sqrt{\frac{n}{c}} \right)ϕ(b). $$

answered Jun 18 '18 at 02:30

Ѕᴀᴀᴅ

35,369

While correct, the usage of an integral that has no return in common functions though makes no sense. In this way you could simply implement its usage as the CDF of the normal distribution, fetching an interval for $\mathbb P(c_2 \leq Y \leq c_1)$. – Rebellos Jun 18 '18 at 07:09
Note that : $$\int_{-\infty}^x \exp\bigg{ -\frac{x^2}{2} \bigg} = \sqrt{\frac{\pi}{2}} \bigg[ \rm{erf}\bigg(\frac{x}{\sqrt{2}}\bigg)+1\bigg]$$ – Rebellos Jun 18 '18 at 07:21
1

You have used the distribution of $\sum X_i$ instead of $\bar X$. – StubbornAtom Apr 13 '19 at 15:18

Sufficient statistics function for $N(\theta, c\theta^2)$ and symmetrical confidence interval using $\bar{X}$

2 Answers2

Linked