4

Consider a random variable that follows an exponential distribution. After binning (floor), becomes discrete and follows a geometric distribution. My question is: how can we recover the original exponential distribution by adding random noise to the binned values?

A common practice in my field is to add uniformly distributed random noise to the binned data. However, this is wrong, Math says. Below are my calculations:

Suppose $$ P(X = k\Delta m) =p(1-p)^k = (1-e^{-\beta \Delta m}) e^{-\beta \Delta m k} $$ $$ P(Y \leq m - k\Delta m) = \frac{m - \Delta m}{\Delta m}, 0 \leq m - k\Delta m \leq \Delta m $$

So, for $ 0 \leq m - k\Delta m \leq \Delta m $ (i.e., only for values within a bin), we have the CDF of Z is: $$ P(z) = \sum_{k} (1-e^{-\beta \Delta m}) e^{-\beta \Delta m k} \frac{m - k\Delta m}{\Delta m} $$

Clearly, this is not an exponential. Let's see what happens when the binning decreases. Set $x = k\Delta m$. For $\Delta m \to 0$, we can approximate the sum with an integral:

$$ P(z) \approx \int_0^z (1-e^{-\beta \Delta m}) e^{-\beta x} \frac{m - x}{\Delta m} \, dx $$

Moreover, for $\Delta m \to 0$, $1-e^{-\beta \Delta m} \approx \beta \Delta m$. Therefore:

$$ P(z) \approx \beta \int_0^z e^{-\beta x}(m - x) \, dx $$

Integrating by parts, I get: $$ P(z) \approx m - (\frac{e^{-\beta m}}{\beta} - 1) $$

So, I do not get an exponential distribution either for very small bins. Are the calculations correct?

If yes, how should the noise be distributed to recover the exponential distribution?

Roland
  • 143
  • 6

3 Answers3

4

Assume the original distribution had the density $f_Y(y)=\lambda e^{-\lambda y}$, $y\ge 0$, with the CDF $F_Y(y)=1-e^{-y}$. Let $X=\lfloor Y\rfloor$, then $p=\mathbb P(X=0)=\mathbb P(0\le Y<1)=1-e^{-\lambda}$ and $X\in\{0,1,2,\dots\}$ is Geometric($p$).

Now note that for $k\le y<k+1$ $$ f_{Y\mid X=k}(y)=f_{Y\mid k\le Y<k+1}(y) =\frac{\lambda e^{-\lambda y}}{\mathbb P(Y\mid k\le Y<k+1)} =\frac{\lambda e^{-\lambda y}}{e^{-\lambda k}-e^{-\lambda (k+1)}} =\frac{\lambda e^{-\lambda y}}{p^k(1-p)} $$ which means that given $X=k$, to get $Y$ you should add a random variable $W$ distributed on $[0,1]$ with the density $$ \frac{\lambda e^{-\lambda (k+w)}}{p^k(1-p)}, \quad w\in [0,1] $$ where $\lambda=-\log(1-p)$.

van der Wolf
  • 5,743
  • 4
  • 13
3

You start with a random variable $\text{X}$ following an exponential distribution with rate parameter $\beta$, so its PDF is:

$$\text{f}_\text{X}\left(x\right)=\beta\exp\left(-\beta x\right),\space\space\space\space\space\space\space x\ge0\tag1$$

And its CDF is:

$$\mathbb{P}\left(\text{X}\le x\right)=1-\exp\left(-\beta x\right),\space\space\space\space\space\space\space x\ge0\tag2$$

When $\text{X}$ is binned by flooring—i.e., assigning it to the lower boundary of bins of width $\Delta\text{m}$-you define a discrete random variable. Let’s call this binned variable $\displaystyle\text{W}=\Delta\text{m}\left\lfloor\frac{\displaystyle\text{X}}{\displaystyle\Delta\text{m}}\right\rfloor$. The value $\text{W}=\text{k}\Delta\text{m}$ occurs if $\text{X}$ falls in the interval $\left[\text{k}\Delta\text{m},\left(\text{k}+1\right)\Delta\text{m}\right)$, where $\text{k}\in\mathbb{N}$. The probability is:

$$\mathbb{P}\left(\text{W}=\text{k}\Delta\text{m}\right)=\mathbb{P}\left(\text{k}\Delta\text{m}\le\text{X}<\left(\text{k}+1\right)\Delta\text{m}\right)=\int\limits_{\text{k}\Delta\text{m}}^{\left(\text{k}+1\right)\Delta\text{m}}\beta\exp\left(-\beta x\right)\space\text{d}x\tag3$$

Evaluate the integral and find:

$$\mathbb{P}\left(\text{W}=\text{k}\Delta\text{m}\right)=\exp\left(-\beta\text{k}\Delta\text{m}\right)\left(1-\exp\left(-\beta\Delta\text{m}\right)\right)\tag4$$

This matches your expression, though note that since $\text{X}$ is continuous, $\mathbb{P}\left(\text{X}=\text{k}\Delta\text{m}\right)=0$; you likely meant $\mathbb{P}\left(\text{W}=\text{k}\Delta\text{m}\right)$. This is a geometric distribution with success probability $\text{p}=1-\exp\left(-\beta\Delta\text{m}\right)$, confirming your statement.

To reconstruct an approximation of $\text{X}$, you add noise $\text{Y}\sim\text{Uniform}\left[0,\Delta\text{m}\right)$ to $\text{W}$, defining $\text{Z}=\text{W}+\text{Y}$. For a given $\text{W}=\text{k}\Delta\text{m}$, $\text{Z}$ ranges from $\text{k}\Delta\text{m}$ to $\left(\text{k}+1\right)\Delta\text{m}$, and $\text{Y}$'s density is $\displaystyle\frac{\displaystyle1}{\displaystyle\Delta\text{m}}$ over $\left[0,\Delta\text{m}\right)$. Since $ \text{W}$ is discrete and $\text{Y}$ is continuous, $\text{Z}$ is a continuous random variable, and needs its distribution.

The CDF of $\text{Z}$ is:

$$\mathbb{P}=\left(\text{Z}\le\text{z}\right)=\mathbb{P}\left(\text{W}+\text{Y}\le\text{z}\right)=\sum_{\text{k}\space\ge\space0}\mathbb{P}\left(\text{W}=\text{k}\Delta\text{m}\right)\mathbb{P}\left(\text{Y}\le\text{z}-\text{k}\Delta\text{m}\space\text{|}\space\text{W}=\text{k}\Delta\text{m}\right)\tag5$$

where the sum is over all $\text{k}$ contributing non-zero probabilities. For $\text{Y}\sim\text{Uniform}\left[0,\Delta\text{m}\right)$:

  • If $\text{z}-\text{k}\Delta\text{m}<0$ (i.e. $\text{k}\Delta\text{m}>\text{z}$), then $\mathbb{P}\left(\text{Y}\le\text{z}-\text{k}\Delta\text{m}\right)=0$;
  • If $\text{z}-\text{k}\Delta\text{m}\ge\Delta\text{m}$ (i.e. $\text{k}\Delta\text{m}\le\text{z}-\Delta\text{m}$), then $\mathbb{P}\left(\text{Y}\le\text{z}-\text{k}\Delta\text{m}\right)=1$;
  • If $0\le\text{z}-\text{k}\Delta\text{m}<\Delta\text{m}$ (i.e. $\text{z}-\Delta\text{m}<\text{k}\Delta\text{m}\le\text{z}$), then $\mathbb{P}\left(\text{Y}\le\text{z}-\text{k}\Delta\text{m}\right)=\frac{\displaystyle\text{z}-\text{k}\Delta\text{m}}{\displaystyle\Delta\text{m}}$.

Suppose $\text{z}\in\left[\text{n}\Delta\text{m},\left(\text{n}+1\right)\Delta\text{m}\right)$, where $\text{n}=\left\lfloor\frac{\displaystyle\text{z}}{\displaystyle\Delta\text{m}}\right\rfloor$:

  • For $\text{k}=0,1,\dots,\text{n}-1$, $\text{k}\Delta\text{m}\le\left(\text{n}-1\right)\Delta\text{m}<\text{n}\Delta\text{m}\le\text{z}$, and $\text{z}-\text{k}\Delta\text{m}\ge\text{z}-\left(\text{n}-1\right)\Delta\text{m}\ge\Delta\text{m}$, so $\mathbb{P}\left(\text{Y}\le\text{z}-\text{k}\Delta\text{m}\right)=1$;
  • For $\text{k}=\text{n}$, $\text{z}-\text{n}\Delta\text{m}\in\left[0,\Delta\text{m}\right)$, so $\mathbb{P}\left(\text{Y}\le\text{z}-\text{n}\Delta\text{m}\right)=\frac{\displaystyle\text{z}-\text{n}\Delta\text{m}}{\displaystyle\Delta\text{m}}$;
  • For $\text{k}\ge\text{n}+1$, $\text{k}\Delta\text{m}\ge\left(\text{n}+1\right)\Delta\text{m}>\text{z}$, so $\text{z}-\text{k}\Delta\text{m}<0$, and $\mathbb{P}\left(\text{Y}\le\text{z}-\text{k}\Delta\text{m}\right)=0$.

Thus:

$$\mathbb{P}\left(\text{Z}\le\text{z}\right)=\sum_{\text{k}\space=\space0}^{\text{n}-1}\left(\mathbb{P}\left(\text{W}=\text{k}\Delta\text{m}\right)\cdot1+\mathbb{P}\left(\text{W}=\text{n}\Delta\text{m}\right)\cdot\frac{\text{z}-\text{n}\Delta\text{m}}{\Delta\text{m}}\right)\tag6$$

Subsitute $(4)$ and use the geometric series to write:

$$\mathbb{P}\left(\text{Z}\le\text{z}\right)=1-\exp\left(-\beta\Delta\text{mn}\right)+\left(1-\exp\left(-\beta\Delta\text{m}\right)\right)\exp\left(-\beta\Delta\text{mn}\right)\cdot\frac{\text{z}-\text{n}\Delta\text{m}}{\Delta\text{m}}\tag7$$

With $\text{z}\in\left[\text{n}\Delta\text{m},\left(\text{n}+1\right)\Delta\text{m}\right)$.

Jan Eerland
  • 29,457
3

Adding noise seems crazy to me. "Wrong" is not quite the right word but adding unrelated noise is not necessary and likely only increases the uncertainty in the estimation process. (But I believe you when you say it is a common practice in your field. Also, if you don't have to bin: Don't!)

If you know that the underlying distribution of some binned data (however you bin it) is an exponential distribution, then the standard approach is to get a maximum likelihood estimate. (Note there are also Bayesian approaches you might consider.)

Such an approach also gets you and estimate of the precision of the estimate.

If one has $k$ frequency counts $f_1, f_2,\ldots,f_k$ and $k+1$ corresponding histogram bin borders $b_0<b_1<\cdots<b_k$, then the log of the likelihood is given by

$$\log L=\sum_{i=1}^k f_i \log(F(b_i)-F(b_{i-1}))$$

where $F(x)$ is the cumulative distribution function for the distribution from which the sample is taken. In this case $F(x)=1-e^{-\lambda x}$ when $x\geq0$ and $F(x)=0$ otherwise. We have

$$\log L=\sum_{i=1}^k f_i\log(e^{-\lambda b_{i-1}}-e^{-\lambda b_i})$$

We choose for the estimator of $\lambda$ the value that maximizes the likelihood. Most of the time this requires an iterative procedure but there is a closed-form solution when for an exponential distribution the bin borders are non-negative consecutive integers: $0, 1, 2, 3,\ldots$. In that case the maximum likelihood estimator is

$$\log \left(\sum _{i=1}^k i f_i\right)-\log \left(\sum _{i=2}^k (i-1) f_i\right)$$

Here's how to do it using Mathematica.

(* Bin boundaries and frequencies *)
SeedRandom[12345];
h = HistogramList[RandomVariate[ExponentialDistribution[1/2], 100]]
(* {{0,1,2,3,4,5,6,7,8},{35,31,14,6,6,5,2,1}} *)

(* Log of the likelihood *) logL = Sum[h[[2, i]] Log[CDF[ExponentialDistribution[[Lambda]], h[[1, i + 1]]] - CDF[ExponentialDistribution[[Lambda]], h[[1, i]]]], {i, 1, Length[h[[2]]]}]

$$\log \left(e^{-7 \lambda }-e^{-8 \lambda }\right)+2 \log \left(e^{-6 \lambda }-e^{-7 \lambda }\right)+5 \log \left(e^{-5 \lambda }-e^{-6 \lambda }\right)+6 \log \left(e^{-4 \lambda }-e^{-5 \lambda }\right)+6 \log \left(e^{-3 \lambda }-e^{-4 \lambda }\right)+14 \log \left(e^{-2 \lambda }-e^{-3 \lambda }\right)+35 \log \left(1-e^{-\lambda }\right)+31 \log \left(e^{-\lambda }-e^{-2 \lambda }\right)$$

(* Initial guess at maximum likelihood estimate: reciprocal of the mean *)
λ0 = 1/((h[[1, 1 ;; Length[h[[1]]] - 1]] + h[[1, 2 ;;]]) . h[[2]]/(2 Total[h[[2]]]))
(* 20/39 *)

(* Find maximum likelihood estimate ) mle = FindMaximum[{logL, λ > 0}, {{λ, λ0}}] ( {-165.66485033366567,{λ -&gt; 0.524524468750641}} *)

(* Approximate standard error of the maximum likelihood estimator ) stdErr = Sqrt[-1/D[logL, {λ, 2}]] /. mle[[2]] ( 0.05305581097254485` *)

Because the bin borders are consecutive non-negative integers starting at 0, we can apply the closed-form solution:

Log[Sum[i h[[2, i]], {i, 1, 8}]] - Log[Sum[(i - 1) h[[2, i]], {i, 2, 8}]] // N
(* 0.524524 *)

There's no need to involve a geometric distribution.

JimB
  • 2,285