The Gaussian density $\mu(dx)=e^{-x^2/2}\ dx$ is fundamental in probability theory. Does anyone have a (non-computational) heuristic why this function should be special? (By non-computational, I mean without using combinatorial approximations and Stirling's asymptotic.)
- 30,884
-
do you mean a derivation rather than why it is "special"? I – Chinny84 Jun 22 '15 at 20:57
-
@pre-kidney Have you found an answer to this question? If so, I'd really really appreciate some guidance: https://math.stackexchange.com/questions/2766802/intuition-for-n-mu-sigma2-in-terms-of-its-infinite-expansion here or https://math.stackexchange.com/questions/2823278/intuition-for-the-normal-distribution – jaslibra Aug 09 '18 at 14:04
-
@jaslibra since I asked this question, I gained a deeper appreciation for how closely the family of Gaussian measures is associated with inner product spaces. This manifests in various ways, for example the characterization of the Gaussian measure via the property that iid sequences are rotationally invariant. If you are looking for more along these lines, you may be interested in the book "Gaussian Hilbert spaces" by Svante Janson. – pre-kidney Aug 16 '18 at 06:10
-
@pre-kidney was this enlightening toward some natural process which leads to a derivation of the measure? – jaslibra Aug 16 '18 at 19:13
-
@pre-kidney would you like to give an answer to the question for the 200 point bounty? https://math.stackexchange.com/questions/2766802/intuition-for-n-mu-sigma2-in-terms-of-its-infinite-expansion – jaslibra Aug 16 '18 at 19:15
-
I might take a look – pre-kidney Aug 17 '18 at 03:03
2 Answers
The Gaussian can be viewed as the "best guess" of a distribution, given that we only know that it is a distribution, and we know its mean and its variance.
For instance, suppose I have a deck of 52 cards, and I tell you to pick a card "at random". If you had no prior knowledge as to how I would choose my card, what probability of selection would you assign to any given card? I'd say $\mathbb{P}(\text{any card}) = \frac{1}{52}$ is a reasonable guess. This is an example of a "maximum entropy" distribution on the discrete set $\{1,...,52\}$. Mathematically, the solution to the optimisation problem $$\begin{cases} \text{maximise} & \left\{-\sum_{i=1}^{52} p_i \log p_i\right\} \\ \text{subject to}& \sum_{i=1}^{52} p_i = 1\end{cases}$$ is $p_i = 1/52$.
Next, suppose I tell you to pick a number "at random" from the interval $[0,1]$. Having no prior knowledge of my predispositions, you might assign equal likelihood to each number, giving a uniform distribution. Here you are solving the optimisation problem $$\begin{cases} \text{maximise} & \left\{ -\int_0^1 f(x) \ \log f(x)\ dx\right\} \\ \text{subject to} & \int_\mathbb{R} f(x)\ dx = 1 \\ & f \text{ continuous and } f \geq 0.\end{cases}$$
Now suppose I tell you to pick a number "at random" from $\mathbb{R}$. I want your selection to have a mean of $0$ and a variance of $1$. What is the distribution of the number selected? The analogous "maximum entropy" distribution is the Gaussian with density $\frac{1}{\sqrt{2\pi}}\exp(-x^2/2)$. Here, you are solving the optimisation problem $$\begin{cases} \text{maximise} & \left\{ -\int_\mathbb{R} f(x) \ \log f(x)\ dx\right\} \\ \text{subject to} & \int_\mathbb{R} f(x)\ dx = 1 \\ & \int_\mathbb{R} x \;f(x)\ dx = 0 \\ & \int_\mathbb{R} x^2 \;f(x)\ dx = 1 \\ & f \text{ continuous and } f \geq 0.\end{cases}$$
- 7,559
- 24
- 27
-
1Somehow this answer doesn't really answer the question I was going for. You have just shifted the reason elsewhere: Why is maximizing the entropy a natural thing to do? Why is it natural to define entropy as $-\int f\log f$? Why stop at the first two moments when optimizing? For instance, the distribution you get by fixing the first 3 moments and optimizing over the entropy is much less natural than the Gaussian. – pre-kidney Jun 28 '15 at 04:27
-
Basically, I am looking for something that's a little more self-contained. – pre-kidney Jun 28 '15 at 04:28
-
@pre-kidney I think those are fair points. I can only tell you why I chose entropy. You know from the central limit theorem that the Gaussian is the "universal" limiting probability distribution. This fits our intuitive view of how quantifiable traits in everyday life should be distributed - normally. Perhaps we think that with every piece of data, we incorporate an increasing amount of "randomness" into our estimate. One way of formalising this intuition might be to say that the entropy of the sequence $n^{-1/2} S_n$ is increasing. This is in fact true. – snar Jun 28 '15 at 04:50
-
2Actually, that is the whole point of this question: why is the Gaussian the universal limiting probability distribution? – pre-kidney Jun 28 '15 at 13:59
-
Revisiting this thread a while later, I want to point out that the Gaussian is just one universality class, for sequences with finite variance. But there are others (alpha-stable processes) for sequences that only have finite alpha moments (the case alpha=2 is Gaussian). – pre-kidney Aug 16 '18 at 06:15
There are certain properties of Gaussian/normal distributions that make them appealing, beyond the simple stuff like the central limit theorem. For example, the Gaussian/normal has the maximum entropy for a given mean and variance. This says that the Gaussian/normal distribution provides the maximum overall "variation" in the entropy sense, given standard measures of mean and variance, which is appealing if you don't know exactly what distribution your samples follow and don't want to restrict the distribution too much.
Also, when you assume independent Gaussian/normal distribution for noise terms in something like a linear regression problem, then you can show that the maximum likelihood solution for the linear regression problem has a simple matrix formula form, basically because maximizing Gaussian/normal likelihood under the noise distribution is equivalent to minimizing sum-of-squared error after you take log of the Gaussian likelihood. Likelihood becomes a product of Gaussians which then becomes a negative sum of squares after taking log, and minimizing a sum of squares for some form of estimator is essentially always given by the "mean" in some shape or form. Perhaps this is a better answer to your question, I'm not sure.
- 26,450
-
1The "simple stuff" - namely the CLT - is already sufficient to make the Gaussian extremely important. Most proofs of the CLT don't explain the reason that the Gaussian arises in a transparent manner. That is essentially what my question is about. It is non-obvious, given the form of the CLT, that a function like $\exp(-x^2/2)$ should arise! – pre-kidney Jun 28 '15 at 04:31
-
There is one sneaky detail, hidden in the form of the CLT, that sheds more light on where the exp(-x^2/2) comes from - the finite variance assumption in some sense gives rise to the quadratic form of the exponent. – pre-kidney Aug 16 '18 at 06:17