Occam's razor states that shorter explanations (formally speaking, hypotheses) are more likely to be correct. Indeed this can be formalized: for a hypothesis class $\mathcal H$ one may ascribe representations to each hypothesis and take their literal length as strings. The shorter ones are more correct. But the problem is I can just rearrange the language to make perverse hypotheses shorter, at which point they will be more likely. This seems contradictory. What gives?
1 Answers
There are several ways to formalize Occam's razor in learning theory, and they yield theorems that are powerful. However, these theorems do not obviate the need for inductive bias, nor do they create it. That would be like asking a computer program to create entropy. They do not create the ghost in the machine. What they do provide is a way to formalize inductive bias and provide useful theorems for what one may conclude about a set of hypotheses weighted with inductive bias.
If you get "lucky" and the inductive bias you encode happens to match the real world - which you believe is prone to happen if you subscribe to Occam's razor out-of-universe - then the theorems will do a very good job of teasing out the matching hypothesis and will even know how to make tradeoffs between good, short hypotheses and excellent, less-short hypotheses. They will then provide a bound on the error.
What if you are living in a Kuhnian paradigm shift and you are not sure which of two sets of inductive biases is better? Will Occam's razor help you? Yes and no - if you have twice the data you can study both just as before but "twice the data" is of course asking for more information to enter the system. So they cannot magically decide between Newton and Einstein.
One presentation of Occam's razor, from Understanding Machine Learning as Theorem 7.7 is the following. It is a weaker presentation in that it is designed for nonuniform agnostic PAC learnability, which is quite weak, in that no perfect hypothesis can ever be returned. So the conclusions will also be weak.
Let $\mathcal H$ be a hypothesis class and let $d:\mathcal H \rightarrow \{0,1\}^*$ be a prefix-free description language for $\mathcal H$. Then, for every sample size $m$, every confidence parameter $\delta > 0$, and every probability distribution $\mathcal D$, with probability greater than $1 - \delta$ over the choice of $S \sim \mathcal D^m$, we have that $$\forall h \in \mathcal H, L_{\mathcal D}(h) \leq L_S(h) + \sqrt{\frac{|h| + \ln(2/\delta)}{2m}}$$ where $|h|$ is the length of $d(h$).
This gives rise to the Minimum Description Length learning paradigm: For $\mathcal H, \mathcal D, d$ as above and a loss function, output the hypothesis $h$ that minimizes $L_S(h) + \sqrt{\frac{|h| + \ln(2/\delta)}{2m}}$, i.e. the empirical loss plus the estimation error given by the theorem.
So by itself the theorem isn't making any claim about which hypothesis is best. It's actually making a much more boring claim that short hypotheses are easy to estimate well. It is possible to estimate a hypothesis $h$ up to an error factor of $O(|h|^{1/2})$ (where $O$ holds $m,\delta$ constant), and moreover, it is possible to do this for all hypotheses simultaneously. You can get each and every error estimate and only degrade the quality of the estimate in $O(|h|^{1/2})$.
Now I will convince you this is not trivial. Bear in mind at the other extreme, if you have a single hypothesis $h$ and you want to estimate its error, that's actually really easy to do on a small number of examples. It's the question of determining the weight of an unbiased coin by flipping it a few times, which is given by the Hoeffding/Chernoff bounds. Spoiler alert, you can achieve an estimation error of $\sqrt{\frac{\ln(2/\delta)}{2m}}$ with probability $1-\delta$. In above notation, that's $O(1)$ in $|h|$. All you were trying to do in the Theorem, for all $h$ equipped with some length function, is estimate the error and you achieved $O(|h|^{1/2})$ but here you achieved $O(1)$. What gives?
Let's try estimating for $1000$ hypotheses and try to achieve an estimation error of $10\%$ with probability $99\%$ just by using a simple Hoeffding bound. We calculate we need $265$ examples to achieve this. Not bad. So the first hypothesis we estimate within $10\%$ with probability $99\%$. The second hypothesis we estimate within $\10\%$ with probability $99\%$. But wait, that means the probability that either may fail could exceed $1\%$. By basic probability we only know the probability both estimates succeed is at least $98.01\%$.
And that's not good, because we have 1000 hypotheses. $.99^{1000}=0.004\%$. Not good. And in the context of machine learning, if one hypothesis is estimated wrong, very often the hypothesis is an overfitter. So one failure is enough to ruin the whole model.
So if you do want to estimate 1000 hypotheses simultaneously, you will need many more examples (not too bad -- $\ln(2/\delta)$ does not grow too fast in $1/\delta$) to achieve this.
And if you have infinitely many, you simply cannot do it by naive application of the Hoeffding bound.
So the Theorem lets you achieve an $O(|h|^{1/2})$ estimation error for all hypotheses simultaneously. This is powerful.
What if you try to rig the game and take a complex hypothesis and rig it so its length is just $1$? Then you will get a good estimate for this hypothesis and a worse estimate for many others. The estimation error power is conserved. If you were interested in that hypothesis, you may have also just drawn 265 (or however many) examples and run a Hoeffding bound on it. This is no different.
And if you want to try two totally different schema, you may do exactly that, but you will need a tighter $\delta$ to assure that both succeed. This is a serious mistake you can make when running a machine learning algorithm!
The MDL learner makes no guarantees about returning the best hypothesis. It guarantees to return one that solves a particular set of bounds. That particular set of bounds incorporates "out of universe" Occam's razor, your own inductive bias as the learner or programmer or source of entropy, and will formalize that inductive bias into a set of bounds that will focus more attention onto "simpler" hypotheses, and if your inductive bias is on track, they will output aggressive error bounds on the hypothesis. I reiterate this is the nonuniformly PAC learnable model, so it's too much to ask for a guarantee on the resulting hypothesis. In that sense MDL is weak, as it must be.
A finite PAC formulation of Occam's razor is:
Fix $\mathcal H$ and fix $\epsilon, \delta$. Suppose there exists a finite $\mathcal L \subset \mathcal H$ with the following property: Take $m \geq \frac1\epsilon(\ln|\mathcal L| + \ln(\frac1\delta)$. If for every $S \sim \mathcal D^m$ there is an $h \in \mathcal L$ so that $h$ is consistent with $S$, then $h$ is $\epsilon$-accurate and $\delta$-confident, so $\mathcal H$ is PAC learnable with this algorithm.
i.e. if a small hypothesis class can consistently represent pretty large sets of examples, then it is PAC. This is another version of Occam's Razor that, being in a PAC setting, is stronger than the nonuniform PAC discussion above: if the hypothesis class is surprisingly rich for how surprisingly small it is, then it actually describes the world accurately. But it demands a lot, as a real-world machine learner would be pressed to find a noiseless set of data that clean, and I'm not sure if this particular "Occam's razor" theorem generalizes to harsher terrain.
- 497
- 4
- 15