6

The term 'divergence' means a function $D$ which takes two probability distributions $g,f$ as input and puts out a non-negative real number $D(g,f)$. I have learnt that the inference based on minimizing the following divergence is robust against those data points which are not compatible with the assumed model (called outliers), for some specific range of $\alpha \in \mathbb{R}^n$. $$D_{\alpha}(g,f) = \frac{1}{\alpha-1}\log \int g^{\alpha}f^{1-\alpha}~dx,$$ where $g$ stands for the data driven density and $f$ stands for the model density (see, for example here). There are several names to this $D_\alpha$ in the literature viz., Renyi divergence, power divergence, $\alpha$-divergence etc. Can someone give a justification of how this method is robust?

Update: (as suggested by @Bruce Trumbo)

Some Preliminaries:

Suppose that $x_1,\dots,x_n$ i.i.d. samples drawn according to a particular member of a parametric family of probability distributions $\mathcal{F} = \{f_\theta\}$ (model). (Let us assume, for simplicity, that the $f_{\theta}$'s have a common support set $\mathbb{X}$ which is finite and also that $\mathbb{X}$ is the undelying sample space.)

Our objective is to choose a special member $f_{\theta^*}$ of $\mathcal{F}$ which "best" explains the observed samples.

The Maximum Liklihood Estimate (MLE) is a widely used inference method which asks us to choose the $f_\theta$ for which $\Pi_{i=1}^n f_\theta(x_i)$ is maximum.

Let $\hat{f}$ be the empirical measure of the $x_i$'s. Then observe that \begin{eqnarray} \frac{\prod_{i=1}^n f_\theta(x_i)}{\prod_{i=1}^n \hat{f}(x_i)} & = & \prod_{x\in\mathbb{X}} \left(\frac{f_\theta(x)}{\hat{f}(x)}\right)^{n\hat{f}(x)}\\ & = &\exp\{-nD(\hat{f}\|f_\theta)\}, \end{eqnarray} where $D(\hat{f}\|f_\theta)=\sum_x \hat{f}(x)\log(\hat{f}(x)/f_\theta(x))$ is the well-known Kullback-Leibler divergence.

Hence, MLE is a minimizer of $D(\hat{f}\|f_\theta)$ over $\theta$. For minimization, $(\partial/\partial_\theta) D(\hat{f}\|f_\theta) = 0$. This implies $$\sum_{x\in \mathbb{X}} \hat{f}(x) \frac{\partial}{\partial\theta}\log f_\theta(x) = 0.$$ That is, $$\sum_{i=1}^n \frac{\partial}{\partial\theta}\log f_\theta(x_i) = 0,$$ called the score equation. Thus, one needs to solve the score equation for finding the MLE.

However, MLE is not robust when few of the $x_i$ are outliers.

The inference based on minimizing the power divergence is known to be robust against the outliers. Note that the power divergence $D_\alpha(\cdot\|\cdot)\to D(\cdot\|\cdot)$ as $\alpha\to 1$.

I roughly remember that the "improved" (or generalized) score equation corresponding to the power divergence to be something like $$\sum_{i=1}^n f_{\theta}(x_i)^c\frac{\partial}{\partial\theta}\log f_\theta(x_i) = 0,$$ for some $c>0$, potentially a function of $\alpha$.

The thing I would like to know is whether there is any way by which this generalized score equation can be shown to be derived from the power divergence as in the case of MLE as the score equation is derived from the Kullback-Leibler divergence.

PS: I asked a question along the same lines in stats.stackexchange which did not get much attention. The link is here.

Ashok
  • 1,981
  • Might help if you defined terms and gave some context. – BruceET Sep 01 '15 at 20:44
  • @BruceTrumbo: Thank you for your suggestion. I have now edited the question and elaborated. – Ashok Sep 05 '15 at 07:50
  • Looks like you got your answer on Stats.SE? –  Sep 09 '15 at 10:32
  • 1
    @Bey: I am not fully convinced by that response. However, I accepted that answer as it atleast gave some justification and started some discussion. I am keeping it here to see if someone else gives a better explanation. If it is violation of etiquette, I can remove this post. – Ashok Sep 10 '15 at 06:36

1 Answers1

0

I'm late here, still, I would try to give some intuitive answer to your question:

Can someone give a justification of how this method is robust?

The score equation you gave is: $$\sum_{i=1}^n f_{\theta}(x_i)^c\frac{\partial}{\partial\theta}\log f_\theta(x_i) = 0$$ Suppose, the original value of the parameter is $\theta_0$. Also note that, this $c$ would be clearly $0$ for the MLE.

Assume there are some outliers that tend to make the root of the equation towards $\theta_1$ as the MLE is not 'robust enough.' So, if an outlier is on the $+\infty$ side and the estimate is dragged in the same side, then $\theta_1$ would tend to be more towards the $+\infty$ due to this outlier.

Now, for the new score equation, think the $f_{\theta}(x_i)^c$ as a weightage to the original score equation. Suppose you have taken the original $\theta_0$ as the value of $\theta.$ You can see that these weights are less for the outliers. So it creates a decrease of weightage for the 'bad' points in the new score equation. Consequently, the final estimate would be less drawn due to the outlier. This is kind of the intuition - why it works.

To see more, you can look at the properties of an M-estimator, usually used in robust statistics, especially the Influence Function part.

I don't remember the derivations are there or not, but you can look at the book Statistical Inference, The Minimum Distance Approach by Basu, Shioya, and Park

  • Sorry for the late response. Yes, we now know the answer lot more clearer. You may find the same in our article https://arxiv.org/pdf/1905.01434.pdf (pages 2 and 3). P.S: You may give us your comments on the article in case you find it intersting. – Ashok Sep 02 '21 at 07:28