3

I would appreciate either one of the following, or both:

  • A source (if a book, with page numbers) where I can find this result proven
  • A proof of the result

The question at hand:

Let $X_{11}, \dots, X_{1n_1}$ be independent and identically distributed random variables with mean $\mu_1$ and variance $\sigma_1^2$, and let $X_{21}, \dots, X_{2n_2}$ be independent and identically distributed random variables with mean $\mu_2$ and variance $\sigma_2^2$. Assume $n_1 \neq n_2$ and $\sigma_1^2 \neq \sigma_2^2$.

Denote $\bar{X}_{1, n_1} = \dfrac{1}{n_1}\sum_{i=1}^{n_1}X_{1i}$ and $\bar{X}_{2, n_2} = \dfrac{1}{n_2}\sum_{i=1}^{n_2}X_{2i}$. Also, let $S_1^2 = \dfrac{1}{n_1 - 1}\sum_{i=1}^{n_1}(X_{1i} - \bar{X}_{1, n_1})^2$ and $S_2^2 = \dfrac{1}{n_2 - 1}\sum_{i=1}^{n_2}(X_{2i} - \bar{X}_{1, n_2})^2$.

As $n_1 \to \infty$ and $n_2 \to \infty$, does the statistic $$T = \dfrac{\bar{X}_{1, n_1} - \bar{X}_{2, n_2}}{\sqrt{\dfrac{S_1^2}{n_1} + \dfrac{S_2^2}{n_2}}}$$ converge in distribution to a random variable; and if so, with what distribution?

Context. The test statistic $T$ is that which arises from Welch's t-test. Conventional statistical wisdom (e.g., here, here) is that regardless of the population distributions of $X_{1i}$ and $X_{2j}$ (i.e., they are not iid normal), the Central Limit Theorem (CLT) may be used so as to justify that $T$ is approximately $\mathcal{N}(0, 1)$. I haven't seen a proof of this, and I am doubtful the classical CLT may be used.

My Efforts. I demonstrated that for a single population, with obvious notational extensions, it holds that $\dfrac{\bar{X} - \mu}{S/\sqrt{n}}$ converges in distribution to a random variable with an $\mathcal{N}(0, 1)$ distribution. However, this result cannot be used in this question.

For one thing, $\bar{X}_{1, n_1} - \bar{X}_{2, n_2}$ cannot be written as a single arithmetic mean as in the result above. For another thing, while $S_1^2 \to \sigma_1^2$ and $S_2^2 \to \sigma_2^2$ in probability, since $\sigma_1^2 \neq \sigma_2^2$ they cannot be "factored out" like in my demonstration, prohibiting direct use of the CLT.

Clarinetist
  • 20,278
  • 10
  • 72
  • 137
  • 1
    https://stats.stackexchange.com/questions/451158/asymptotic-distribution-of-independent-two-sample-t-test This post gives a proof under the null $\mu_{1}=\mu_{2}$ and additional assumption $n_{2}/n_{1}\to c$. – Q9y5 Jan 31 '22 at 01:30
  • 1
    @Q9y5 Thank you for the link. I've worked it out myself, but would like to clean up the notation a bit and add additional explanation in an answer. Thanks again. – Clarinetist Jan 31 '22 at 01:48
  • @Q9y5 Actually, there's one question I have. $\sqrt{cv_1 + v_2}$ is not the standard deviation of $W$, but rather, the standard deviation of $W$ we obtain assuming $\dfrac{c_2}{c_1} \to c$. Does the CLT still apply in this case? – Clarinetist Jan 31 '22 at 02:00
  • @Q9y5 Actually, I'm doubting that the $N(0, 1)$ claim holds true now. Yeah, you've subtracted a random variable, its mean, and divided that difference by the standard deviation, but the problem is that the random variable cannot be represented by an arithmetic mean, so you can't appeal to the CLT. Unless I'm missing something else entirely here. – Clarinetist Jan 31 '22 at 02:11
  • Here I think we should use triangular array CLT, and change the summation a bit. For example, in the case $n_{1}>n_{2}$, rewrite $W_{n} \equiv \frac {\sqrt{n_2}}{n_1} \sum_{i=1}^{n_1}X_{1i} - \frac 1{\sqrt{n_2} }\sum_{i=1}^{n_2}X_{2i}=\frac{1}{n_{1}}\sum_{i=1}^{n_1}(X_{1i}^{\prime}+X_{2i}^{\prime})$ with $X_{1i}^{\prime}\equiv \sqrt{n_{2}}X_{1i}$, $X_{2i}^{\prime}\equiv \frac{n_{1}}{\sqrt{n_{2}}}X_{2i}1_{{i\leq n_{2}}}$, this won't affect independency. – Q9y5 Jan 31 '22 at 02:22
  • @Q9y5 I'm quite frankly not familiar with how to appropriately apply Lyapunov or Lindeberg, so if you would like to post an answer, I would appreciate it. I'll try to work through it myself regardless. – Clarinetist Jan 31 '22 at 02:23
  • @Clarinetist you should check Billingley's book on weak convergence (this is a book on functional analysis despite its probabilistic appeareance). You essentially will note that $Z_{n,m} = (X_{1,n}, S_{1,n}^2, X_{2,m}, S_{2,m}^2)$ converges as $n, m \to \infty$ to a random vector $Z = (X, \sigma_X^2, Y, \sigma_Y^2)$ so that $T_{n,m}$ is a continuous function of $Z_{n,m}$ so that the distributional-limit of $T_{n,m}$ is the same continuous function of $Z$ (i.e. if $T_{n,m} = f(Z_{n,m})$ then $T = f(Z)$). – William M. Jan 31 '22 at 19:45

2 Answers2

1

Additional assumption: $\frac{n_{2}}{n_{1}}\to c$ and $n_{1}\land n_{2}\to \infty$,

Denote $n\equiv n_{1}\lor n_{2}$. Construct the following triangular array

$$\begin{matrix}Y_{1,1}\\Y_{2,1}&Y_{2,2}\\Y_{3,1}&Y_{3,2}&Y_{3,3}\\ \cdots&\cdots&\cdots&\cdots\\ Y_{n,1}&Y_{n,2}&Y_{n,3}&\cdots&Y_{n,n}\\\cdots&\cdots&\cdots&\cdots&\cdots&\cdots\end{matrix}$$

with $Y_{n,i}\equiv \frac{\sqrt{n_{2}}}{n_{1}}X_{1,i}1_{\{i\leq n_{1}\}}-\frac{1}{\sqrt{n_{2}}}X_{2,i}1_{\{i\leq n_{2}\}}$. Then it remains to prove $$\frac{\sum_{i=1}^{n}Y_{n,i}}{\sqrt{\frac{n_{2}}{n_{1}}\sigma_{1}^{2}+\sigma_{2}^{2}}}\to_{d}\mathcal{N}(0,1) \quad\text{Under }\mathbb{H}_{0}:\mu_{1}=\mu_{2}.$$

By construction $Y_{n,i}$ is row-wise independent, also we have $$\mathbb{E}[Y_{n,i}]=\frac{\sqrt{n_{2}}}{n_{1}}\mu_{1}1_{\{i\leq n_{1}\}}-\frac{1}{\sqrt{n_{2}}}\mu_{2}1_{\{i\leq n_{2}\}},$$ and $$\mathrm{Var}(Y_{n,i})=\frac{n_{2}}{n_{1}^{2}}\sigma_{1}^{2}1_{\{i\leq n_{1}\}}+\frac{1}{n_{2}}\sigma_{2}^{2}1_{\{i\leq n_{2}\}}.$$ This gives $$\sum_{i=1}^{n}\mathbb{E}[Y_{n,i}]=\sqrt{n_{2}}\mu_{1}-\sqrt{n_{2}}\mu_{2}=0\quad \text{(under the null)},$$ and $$\mathrm{Var}\biggl(\sum_{i=1}^{n}Y_{n,i}\biggr)=\sum_{i=1}^{n}\mathrm{Var}(Y_{n,i})=\frac{n_{2}}{n_{1}}\sigma_{1}^{2}+\sigma_{2}^{2}.$$

The desired convergence is guaranteed by triangular array CLT

Lindeberg-Feller Theorem: Let $\{Y_{n,i}\}$ be a row-wise independent triangular array of random variables with $\sum_{i=1}^{n}\mathbb{E}[Y_{n,i}]=0$ and $\sigma_{n}^{2}\equiv\sum_{i=1}^{n}\sigma_{n,i}^{2}$. Let $Z_{n}\equiv\sum_{i=1}^{n}Y_{n,i}$, then $Z_{n}/\sigma_{n}\to_{d}\mathcal{N}(0,1)$ if the Lindeberg condition holds: $$\frac{1}{\sigma_{n}^{2}}\sum_{i=1}^{n}\mathbb{E}[Y_{n,i}^{2}1_{\{|Y_{n,i}|>\varepsilon\sigma_{n}\}}]\to 0,\quad \text{for every }\varepsilon>0.$$

Note that $$ \begin{align*} Y_{n,i}^{2}&=\Bigl(\frac{\sqrt{n_{2}}}{n_{1}}X_{1,i}1_{\{i\leq n_{1}\}}-\frac{1}{\sqrt{n_{2}}}X_{2,i}1_{\{i\leq n_{2}\}}\Bigr)^{2}\\ &\leq 2\Bigl(\frac{n_{2}}{n_{1}^{2}}X_{1,i}^{2}1_{\{i\leq n_{1}\}}+\frac{1}{n_{2}}X_{2,i}^{2}1_{\{i\leq n_{2}\}}\Bigr), \end{align*}$$ separate terms in the summation and by dominated convergence theorem, it's easy to verify the Lindeberg condition.

Q9y5
  • 1,552
1

To derive the asymptotic distribution, we use same symbol as above post. Denote \begin{align*} T&=\frac{\bar{X}_{1,n_1}-\bar{X}_{2,n_2}}{\sqrt{\dfrac{S_{1,n_1}^2}{n_1}+ \dfrac{S_{2,n_2}^2}{n_2}}}\\ &= \sqrt{\frac{S_{1,n_1}^2/n_1}{S_{1,n_1}^2/n_1+S_{2,n_2}^2/n_2}} \frac{\bar{X}_{1,n_1}-\mu_1}{\sqrt{S_{1,n_1}^2/n_1}} \\ &\qquad - \sqrt{\frac{S_{2,n_2}^2/n_2}{S_{1,n_1}^2/n_1+S_{2,n_2}^2/n_2}} \frac{\bar{X}_{2,n_2}-\mu_2}{\sqrt{S_{2,n_2}^2/n_2}}+\mu\\ &=\alpha_{1,n_1}\bar{Y}_{1,n_1}+\alpha_{2,n_2}\bar{Y}_{2,n_2}+\mu\\ &=\boldsymbol{\alpha}^\top_n\cdot \overline{\boldsymbol{Y}}_n+\mu,\quad (n=(n_1,n_2)), \end{align*} where \begin{align*} \boldsymbol{\alpha}^\top_n &=(\alpha_{1,n_1},\alpha_{2,n_2}) =\Bigg(\sqrt{\frac{S_{1,n_1}^2/n_1}{S_{1,n_1}^2/n_1+S_{2,n_2}^2/n_2}}, -\sqrt{\frac{S_{2,n_2}^2/n_2}{S_{1,n_1}^2/n_1+S_{2,n_2}^2/n_2}} \Bigg),\\ \mu &=\frac{\mu_1-\mu_2}{\sqrt{S_{1,n_1}^2/n_1+S_{2,n_2}^2/n_2}}, \\ \overline{\boldsymbol{Y}}_n^\top&= \Big(\bar{Y}_{1,n_1} , \bar{Y}_{2,n_2}\Big) = \Bigg(\frac{\bar{X}_{1,n_1}-\mu_1}{\sqrt{S_{1,n_1}^2/n_1}} , \frac{-(\bar{X}_{2,n_2}-\mu_2)}{\sqrt{S_{2,n_2}^2/n_2}} \Bigg). \end{align*}

The following facts is obvious, as $n\to\infty(n_1\wedge n_2\to\infty)$, \begin{gather*} \frac{S^2_{1,n_1}}{\sigma_1^2}\stackrel{\text{a.s.}}{\longrightarrow}1,\qquad \frac{S^2_{2,n_1}}{\sigma_2^2}\stackrel{\text{a.s.}}{\longrightarrow}1.\\ \frac{S_{1,n_1}^2/n_1+S_{2,n_2}^2/n_2}{\sigma_{1}^2/n_1+\sigma_{2}^2/n_2} \stackrel{\text{a.s.}}{\longrightarrow}1. \tag{1}\\ \overline{\boldsymbol{Y}}_n \stackrel{\text{dist}}{\longrightarrow}N(\boldsymbol{0},\boldsymbol{I}_2).\tag{2} \end{gather*} Denote \begin{equation*} \boldsymbol{a}_n^\top=(a_{1,n_1},a_{2,n_2}) =\Bigg(\sqrt{\frac{\sigma_1^2/n_1}{\sigma_1^2/n_1+\sigma_2^2/n_2}}, -\sqrt{\frac{\sigma_2^2/n_1}{\sigma_1^2/n_1+\sigma_2^2/n_2}}\Bigg), \end{equation*} then $\|\boldsymbol{a}_n\|=1$ and by (1) \begin{gather*} \|\boldsymbol{\alpha}_n-\boldsymbol{a}_n\|\stackrel{\mathsf{P}}{\longrightarrow}0.\\ (\boldsymbol{\alpha}_n-\boldsymbol{a}_n)^\top\cdot \overline{\boldsymbol{Y}}_n \stackrel{\mathsf{P}}{\longrightarrow}0.\tag{3} \end{gather*} From (2), the sequence of distributions of $\{\overline{\boldsymbol{Y}}_n, n\ge 1\}$ is tight, hence from $\|\boldsymbol{a}_n\|=1$, the sequence of distributions of $\{\boldsymbol{a}^\top_n \cdot\overline{\boldsymbol{Y}}_n,n\ge 1\}$ is tight too. Now we prove that the distributions of $\{\boldsymbol{a}^\top_n \cdot\overline{\boldsymbol{Y}}_n,n\ge 1\}$ has unique limit point.

If $\{\boldsymbol{a}^\top_{n'}\cdot\overline{\boldsymbol{Y}}_{n'}\}$ is a subsequence, converging in distribution, no loss generality, we may suppose that $\boldsymbol{a}_{n'}\to \boldsymbol{a}$ and $\|\boldsymbol{a}\|=1$, otherwise, take a converging sub-subsequence. Hence from (2) we have \begin{equation*} \boldsymbol{a}_{n'}\cdot \overline{\boldsymbol{Y}}_{n'} \stackrel{\text{dist}}{\longrightarrow} N(0,1). \tag{4} \end{equation*} (4) means that $N(0,1)$ is the unique limit point of the distributions of $\{\boldsymbol{a_n}^\top \cdot\overline{\boldsymbol{Y}}_n,n\ge 1\}$ and
\begin{equation*} T=\boldsymbol{\alpha}^\top_n\cdot \overline{\boldsymbol{Y}}_n \stackrel{\text{dist}}{\longrightarrow} N(0,1). \tag{5} \end{equation*}

In summary, if $\mu_1\ne\mu_2$, then, as $n\to\infty$, \begin{equation*} |T|\stackrel{\text{a.s.}}{\longrightarrow}+\infty. \end{equation*} If $\mu_1=\mu_2$, then (5) holds as $n\to\infty$.

JGWang
  • 5,601