7

Consider some positive random variables $X^1, X^2$ and $Y\sim Beta(p; 1)$ where $p=\beta_1X^1 + \beta_2X^2$. We have a random sample $\{X^1_i, X^2_i, Y_i\}$. Now, estimating $\beta_1, \beta_2$ is not hard, e.g. using GLM.

Now let's say that we do not observe $X^2$. Is it still possible to consistently estimate $\beta_1$ (i.e. using only a statistic consisting of $\{X^1_i, Y_i\}$)? If not, is there some assumption (normality etc.) for $X^i$ under which we can consistently estimate $\beta_1$?

Albert Paradek
  • 897
  • 1
  • 7
  • 19

2 Answers2

3

Update I realized a problem with my previous post, here's an update with a general philosophy of "marginalizing" out unwanted covariates.

Introduction

Suppose we have a mean model \begin{align*} \mathbb{E}[Y|X^1, X^2] &= \beta_0 + \beta_1 X^1 + \beta_2 X^2 \tag{$\heartsuit$}\\ \mathbb{E}[Y|X^1, X^2] &= e^{\beta_0 + \beta_1 X^1 + \beta_2 X^2} \tag{$\diamondsuit$} \end{align*} Assuming $X^1 \perp \!\!\! \perp X^2$, we have from $(\heartsuit)$ the marginal model \begin{align*} \mathbb{E}[Y|X^1] = \mathbb{E}[\mathbb{E}[Y|X^1, X^2]] = \beta_0 + \beta_1 X^1 + \beta_2\mathbb{E}[X^2|X^1] = \beta_0^* + \beta_1^* X^1 \end{align*} with $(\beta_0^*, \beta_1^*) = (\beta_0 + \beta_2\mathbb{E}[X^2|X^1], \beta_1)$ or from $(\diamondsuit)$ the marginal model \begin{align*} \mathbb{E}[Y|X^1] = \mathbb{E}[\mathbb{E}[Y|X^1, X^2]] = e^{\beta_0^* + \beta_1^* X^1} \end{align*} where $(\beta_0^*, \beta_1^*) = (\beta_0 + \log \mathbb{E}[e^{\beta_2 X^2}|X_1], \beta_1)$. In both these cases, where $Y$ is assumed to follow a linear or log-linear model with respect to covariates $X^1, X^2$, we have that the marginal coefficient $\beta_1^*$ is numerically equal to the full conditional model coefficient $\beta_1$.

Consistently estimating a parameter Generalized estimating equations set up equations of the form \begin{align*} \sum_{k=1}^{K} \mathbf{D}_k^\intercal \mathbf{V}_k^{-1}(\mathbf{Y}_k - \boldsymbol{\mu}_k) = \mathbf{0} \end{align*} where $\mathbf{Y}_k$ is a vector of observations which are believed with be correlated within itself, but $\mathbf{Y}_k \perp \!\!\! \perp \mathbf{Y}_{k'}$ (vectors are independent of each other). Without diving too deep into the theory of GEEs, the relevance here is that, as long as a mean model is correctly specified, fitting a GEE will always return consistent estimates of the desired coefficients. And a GLM is a special case of a GEE! What this implies, for example, if we have a model with \begin{align*} Y_i = \beta_0 + \beta_1 X^1_i + \beta_2 X^2_i + \epsilon_i, \qquad \epsilon \sim F \end{align*} for any distribution $F$ with mean 0 and finite second moments (say, $\epsilon_i \sim \text{Exponential}(\frac{1}{2021}) - 2021$), fitting as if these errors were normal would still yield consistent estimates of $(\beta_0, \beta_1, \beta_2)$! Of course, the standard errors for these estimators would be incorrect and would need to be adjusted with sandwich estimators, but the consistency is not affected.

Application to current problem The mean of $\text{Beta}(p, 1)$ is $\frac{p}{p+1}$, which is not a tractable form to perform the marginalization trick because of the "non-separability" of $X^2$ from $X^1$. My idea was to search for a transformation to make it more tractability, and once close transformation is setting $\widetilde{Y}_i = -\log(Y_i)$, which provides the mean model \begin{align*} \mathbb{E}[\widetilde{Y}_i|X^1, X^2] = \frac{1}{p} = \frac{1}{\beta_0 + \beta_1 X^1 + \beta_2 X^2} \end{align*} This still isn't quite there, but now if you're willing to accept the functional form \begin{align*} p = e^{\beta_0 + \beta_1 X^1 + \beta_2 X^2} \end{align*} (which somewhat makes more sence, since this guarantees $p > 0$ as needed in the beta distribution), we have that \begin{align*} \mathbb{E}[\widetilde{Y}_i|X^1, X^2] = e^{-\beta_0 - \beta_1 X^1 - \beta_2 X^2} \end{align*} and therefore the marginalization trick reveals \begin{align*} \mathbb{E}[\widetilde{Y}_i|X^1] = e^{\beta_0^* + \beta_1^* X^1 } \end{align*} where $(\beta_0^*, \beta_1^*) = (-\beta_0 +\log \mathbb{E}[e^{-\beta_2 X^2}|X^1], -\beta_1)$. Essentially, we're treating the $X^2$ like errors in a normal regression model, but noting that they may not be mean 0 and therefore would bias the intercept. That's not a problem, since we are interested in $\beta_1$!

Final procedure

  1. Transform $\widetilde{Y} = -\log(Y)$
  2. Regress $\widetilde{Y}$ on $X^1$ with log link with any distribution you want (exponential, normal). The resulting coefficient is consistent for $-\beta_1$.
Tom Chen
  • 4,748
  • This is a very good answer, thanks. – Albert Paradek Dec 15 '21 at 10:23
  • What do you mean by "normal distribution would work as well"? I am very interested what you mean by this. Is there some similar transformation where we would get something like $\widetilde{Y}\sim N(p,1)$? – Albert Paradek Dec 15 '21 at 10:25
  • wait, isnt $$\mathbb{E}[\widetilde{Y}|X^1, X^2] = \frac{1}{ \beta_0 + \beta_1 X^1 + \beta_2 X^2}$$? – Albert Paradek Dec 15 '21 at 10:43
  • I also had a small doubt. Once you derive that $E[\tilde{Y}|X_1]$ is linear in $X_1$, how would you proceed ? Still to make a formal regression you need some hypothesis on the residuals right ? So in theory you would need to know the distribution of $\tilde{Y}|X_1$ . The latter would also depend probably on the parameters $\beta_0, \beta_1$ that you want to find or not ? – Thomas Dec 15 '21 at 10:43
  • ( I also have the same doubt of @Albert Paradek, now that I read his comment, but I am curious to know also how you would proceed if the expectation were linear ) – Thomas Dec 15 '21 at 11:03
  • Hey AlbertParadek and @Thomas, sorry for my sloppy answer before. I've updated with much more info, hope this helps – Tom Chen Dec 15 '21 at 17:20
  • Interesting. But so I understand well that even if in a GLM we are modelling the errors with a wrong distribution, we are always consistent? ( i.e in the limit of N large the result will always be correct? ) – Thomas Dec 15 '21 at 18:38
  • I would also clarify in your answer in the summary of the procedure that you changed the relation between X and p to a more reasonable one as far as I understood... – Thomas Dec 15 '21 at 19:53
  • By the way, do you have any suggestion for the numerical approach in my answer? – Thomas Dec 15 '21 at 20:37
  • I really like this, thanks @TomChen for your answer and time. Last thing that bothers me:

    You assumed that we parametrize $p=e^{\beta_0+\beta_1X_1+\beta_2X_2}$. That's
    reasonable from the info you had. But actually what I am dealing with is that $p\in [0,1]$ (this holds from its interpretation). First thing I did was super simple- what if we put a linear link and don't care about the interpretation. But, maybe I am looking for some link like logit. Do you have some idea how to proceed in that case? I can put it as a new question and give you like 200 points if you solve that :D

    – Albert Paradek Dec 20 '21 at 16:04
  • 1
    Hi @Thomas, let me answer each of your questions accordingly: *Modeling with wrong distributions in a GLM: As long as the mean model is correct, than yes, we get consistent estimates (but with wrong standard errors). For example, suppose $Y \sim \text{Poisson}(e^{\beta_0 + \beta_1 X})$ and $Y \sim \text{Exp}(e^{-\beta_0 - \beta_1 X})$, which both has $E[Y|X] = e^{\beta_0 + \beta_1 X}$. If the Poisson model were the true model, but I model with the exponential, I'd still get consistent estimates for $\beta_0, \beta_1$. – Tom Chen Dec 20 '21 at 17:27
  • @AlbertParadek By luck, a logit link extends naturally. With $\text{logit}(p) = \log(\frac{p}{1-p})$, the inverse link is $\text{expit}(x) = \frac{e^x}{1+e^x}$ and we have $p = \text{expit}(\beta_0 + \beta_1 X^1 + \beta_2 X^2)$. This means that $\mathbb{E}[\widetilde{Y}|X^1, X^2] = \frac{1}{p} = 1 + e^{-\beta_0 - \beta_1 X^1 - \beta_2 X^2}$ and once again $X^1, X^2$ are separable. We just need to define outcomes $Y^\dagger = \widetilde{Y} - 1$ and the previous procedure would return consistent estimates for $-\beta_1$. – Tom Chen Dec 20 '21 at 17:35
  • Hi @TomChen, thanks again. But unfortunately, it is still only a solution for "special cases". I am working with a few models, not sure which one to use and it would be really helpful to have also the classical linear case AND a case with some link between [-1,1] (i.e. modified expit such as $p=\frac{-1+e^{q}}{1+e^{q}}, q=\beta_0+\beta_1X_1+\beta_2X_2 $ or something similar (we deal with covariance btw).) But this solution does not work for these cases. I opened a new question with this specifically, if you are interested to help a bit more :) – Albert Paradek Jan 08 '22 at 14:23
  • hups, i meant [1,2] not [-1,1] of course not negative. the question is here https://math.stackexchange.com/q/4351729/615430 – Albert Paradek Jan 08 '22 at 14:39
2

Maybe one can go numerical. In the presence of latent variables we have to maximize the marginal likelyhood, that here reads:

$log L=\sum_i log (\int dx^2_i p(x^1_1)p(x^2_i)y^{(\beta_1x_i^1+\beta_2x^2_i)}(\beta_1x_i^1+\beta_2x^2_i)$ [1]

One could try the EM algorithm and see if it simplifies things (but the nasty additive term mixing the parameters is not simplifying things...).

An other approach is upgrading $x^2_i$,$\beta_1$ and $\beta_2$ all to r.v. with a prior distribution and sample from the posterior or take a $MAP$ for all of them.

This can be done using PyMC3:

https://docs.pymc.io/en/v3/

https://discourse.pymc.io/t/maximum-likelihood-estimation-of-a-bayesian-model/967

I would try to modify the code from here ( the first linear regression example ) in two ways:

https://docs.pymc.io/en/v3/pymc-examples/examples/getting_started.html

  1. The distribution of $Y$ should follow your beta distribution ;
  2. Upgrade $X_2$ to be a random variable with a prior distribution, rather than a fixed one ;

Having done this, you should be able to obtain MAP estimates for all variables and samples from the posterior. Each $x^i_2$ will be than treated as a parameter of the model so you would be sampling a space of dimensions of the number of samples that you have, therefore I am not sure how much this approach would be stable.

Thomas
  • 4,219
  • This approach is cool in theory, but ultimately because the "parameter dimension" equals your "sample size dimension", you are left with an identifiability problem. Adding a prior helps stabilize it, but the resulting estimates won't be consistent because we constantly have our sample size dominated by the number of parameters :(. – Tom Chen Dec 20 '21 at 17:32
  • Perfect thanks a lot for the feedback! – Thomas Dec 20 '21 at 18:03
  • But now that I am thinking each time we have a latent variable per sample (e.g. gaussian mixtures) we have one parameter per sample, but this does not prevent us from applying these methods. What is different here? – Thomas Dec 20 '21 at 18:12
  • Sorry Thomas that I didn't commented your solution, i must say its a bit too advanced for me to understand, a few expressions that have never encountered yet. I tried to look into them a bit but it's quite hard. I am sure that its an interesting solution though, thanks. – Albert Paradek Jan 08 '22 at 14:45
  • The other solution is much better :) – Thomas Jan 08 '22 at 15:58