0

From Understanding Machine Learning: Theory and Algorithms:

What does the phrase in the red box below mean in terms of set theory?

I see that it means for every $h \in H$ we have $D(|L_s(h) - L_D(h)| \le \epsilon) \ge 1 - \delta$.

But how is $|L_s(h) - L_D(h)| \le \epsilon$ a random variable?

If it's a random variable then in should be of the form $\{a \in A : X(a) \le \epsilon\}$ where $X = |L_s(h) - L_D(h)|$ and $A$ is the sample space.

But what in this definition is $A$?

I know that $L_D(h) = \Bbb E_{z \text{~}D}[l(h,z)] = \sum_{z \in Z}l(h,z)D(z)$ and $L_S(h) = \frac{1}{m}\sum_{z_i \in S} l(h,z)$ where:

$l(h,z)$ is a loss function, $D$ is the distribution on $Z$, and $S$ is a training set.


enter image description here

Oliver G
  • 5,132
  • $|L_s(h)-L_D(h)|\leq \epsilon$ is an event not a random variable. Further, one of the interesting aspects of probability theory is that often the sample space can be shoved under the rug, we really just care about probability measures and distributions—but this is probably not what you want to hear. – Nap D. Lover Jan 19 '19 at 18:20
  • Ah, I see. So what would be the set-theoretic notation for the event $|L_S(h) - L_D(h)| \le \epsilon$? – Oliver G Jan 19 '19 at 18:54
  • Exactly as you wrote: $X=|L_s(h)-L_D(h)|$ is the random variable, and the event is denoted ${\omega \in \Omega : X(\omega) \leq \epsilon}$ where $\Omega$ is the sample space, or often in shorthand as just ${X \leq \epsilon}$ taking into the convention about sample spaces I allude to in my first comment. I just wanted to (perhaps at the risk of pedantry) correct the terminology. One seldom needs to know what is $\Omega$ specifically to measure the set ${X \leq \epsilon}$. All you need is the law/distribution of $X$. – Nap D. Lover Jan 19 '19 at 19:01
  • For example, consider a $\mathcal{U}(a,b)$ random variable on some sample space $\Omega$, i.e. $U: \Omega\to \mathbb{R}$. The distribution is known, $\mathbb{P}(U\leq u)=\frac{u-a}{b-a}$, and so for any Borel set of the real line, $B$, you can compute $\mathbb{P}(U\in B)$, in practice without having to deal with $\Omega$. See https://math.stackexchange.com/questions/2531810/why-does-probability-theory-insist-on-sample-spaces for a better and detailed exposition on how sample spaces are treated. – Nap D. Lover Jan 19 '19 at 19:06
  • But that's exactly what I'm trying to understand. If $X$ is that function, what is the domain? What am I specifically showing the error of with this function? – Oliver G Jan 19 '19 at 19:10
  • I am sorry I did not get the point across clearly enough. I shall link two more answers that I think are excellent in terms of discussing probability theory's seemingly strange convention of handling sample spaces: https://math.stackexchange.com/a/91380/291100 – Nap D. Lover Jan 19 '19 at 19:40
  • And one more: https://math.stackexchange.com/questions/536553/how-can-it-be-meaningful-to-add-a-discrete-random-variable-to-a-continuous-rando/536662#536662 – Nap D. Lover Jan 19 '19 at 19:40
  • It seems those posts claim that it suffices to just accept that there is a sample space that exists such that the random variables hold. But in this particular case I'm making a measure of error of something. I'm putting something into my random variable and claiming that the error on that thing is less than $\epsilon$. Why does it suffice to just assume that that thing exists? Can't it not exist in some cases? Doesn't it matter what that thing is I'm claiming has this measurable error? – Oliver G Jan 19 '19 at 20:10
  • I think it's because the RV is indexed by the hypothesis $h$ which you are confounding with the sample space. Analogous to measuring the loss $L(U,x)$ of an insurance policy with deductible $x$ and incoming uniformly distributed claims $U$. I don't care about the specific outcome $\omega$ that incurs claim $U(\omega)$ but I do care about the deductible level $x$ when computing $\mathbb{E}(L)$ or any relevant probability. Here $L:\Omega \times S\to \mathbb{R}$ where $S=(0,\infty)$ is the set of possible deductibles—$x$ is not random, nor is the hypothesis $h$ here (right?), so not in $\Omega$. – Nap D. Lover Jan 19 '19 at 20:18
  • I'll be more specific about what I'm asking:

    It says the probability of $S$ being $\epsilon$ representative is $1-\delta$. Which means $P(X \le \epsilon) \ge 1 - \delta$ where $X = L_S(h) - L_D(h)$. But $X \le \epsilon$ is an event. Which means it's a subset of some sample space of things. So it's of the form ${a \in A : X(a) < \epsilon }$. What are the things that you substitute into $X$ so say you measure the error of? X by itself is the error between $h$ trained on $S$ and the mean error from $h$ on new data.

    – Oliver G Jan 19 '19 at 20:37
  • But since $X$ is also a random variable, I must be substituting something into it to get the loss. What am I putting into $X$ so that I can claim that for every $h$ in my hypothesis class, I can bound the training and testing loss from that $h$ by $\epsilon$? And if the previous posts still apply to this: Why does it not matter what the sample space is here? Why can I assume that it just exists if that is the case? – Oliver G Jan 19 '19 at 20:37
  • Okay, so I was wrong about the suggested confusion, I apologize, but now we are back to where we started. Let me ask this instead: do you require knowing every $\omega \in \Omega$, when someone tells you to compute $\mathbb{P}(X\leq x)$ where $X$ is a normally distributed RV representing, say the magnitude of an earthquake or any other complicated real-world phenomena? Outside of finite sample spaces that are amenable to combinatorics, we must use abstraction to model the real world rather than explicit sample spaces that show every possible event the phenomena we are modeling possibly allows. – Nap D. Lover Jan 19 '19 at 20:58
  • To "do I require knowing every $\omega \in \Omega...$": In the example you gave, $X$ represents the magnitude of an earthquake, where $x$ is some magnitude. Therefore I know that the sample space $\Omega$ is just the possible earthquakes in some random experiment and $P(X < x) = P({\omega \in \Omega : X(\omega) \le x})$. So all I would need to know to compute $P(X \ge x)$ is what $X$ is describing, which in this case is earthquakes. The specific $\omega$'s would be important in this example because you'd have to check to see if each one has magnitude $\le x$. – Oliver G Jan 19 '19 at 21:28
  • Concerning my question: I don't know what the RV I defined previously $X = |(...)|$ is describing. $X_{\text{earthquakes}} : \text{earthquakes} \rightarrow \text{magnitudes}$, $X_{\text{mine}}: \text{?} \rightarrow \text{error for each $h$}$. – Oliver G Jan 19 '19 at 21:28

0 Answers0