4

The CFA Quantitative Methods book uses the following formula for finding the observation in a sorted list that corresponds to a given percentile $y$ in a set of observations of size $n$:

$(n + 1)\frac{y}{100}$

It defines percentile as follows: "Given a set of observations, the yth percentile is the value at or below which y percent of observations lie."

My question is, where does the $+ 1$ come from? I can see that if you wanted to ensure that all values are below a given percentile, it is useful. It also ensures the correct value for the median. But given the definition of percentile above, I would think it should be possible to have a hundredth percentile, which would be equal to the largest value. Is the "at or below" in conflict with the $+ 1$?

3 Answers3

1

In our sorted list, the indices follow a discrete uniform distribution. As you mentioned, the median is satisfied if we calculate our percentile with $(n + 1)$.

The median (50th percentile) of a discrete uniform distribution is: $\frac{a \ + \ b}{2}$(a proof of the derivation can easily be found online). Where $a,b$ denote our support. Here $a = 1$ and $b = n$.

Now, this leads us to: $0.5p = \frac{n \ + \ 1}{2} \Rightarrow p = n + 1 \Rightarrow xp = x \cdot (n + 1)$
where $x \in [0, 1]$.

0

You answered your own question:

"I can see that if I use the first formula to calculate the 50th percentile, the +1 ensures I get the same answer as when I calculate the median."

That's a really important property for percentiles! One you should want.

Also, if you are fitting empirical data to some parametric curve, adding +1 allows for a "tail". Many curves you would fit to have infinite support, so if you did not add +1, you would be saying your last data point is at the 100%-tile, which is usually a bad assumption.

0

These $+1$ terms show up a lot in counting problems because subtraction isn't quite the opposite of counting. But it is hard to see this symbolically, best to use an example.

Suppose you have $99$ people who took a test. Who is in the $99^\text{th}$ percentile? Well, precisely nobody comes above the best score, and only the best comes above the second-best. But one person is a bit over $1$%, so everyone but the first person is in the bottom $98.99$%, so you want the $99^\text{th}$ percentile to be at the last person. This is what the formula gives.

But $\frac{ny}{100}$ would also get this result, so what gives?

Well, who is in the $100^\text{th}$ percentile? Nobody, by definition. Every single person should be below this mark, which means it cannot begin at any person. This is what the $+1$ formula gives, because $\frac{(n+1)(100)}{100}>n$. if you don't have the $+1$, then you get $\frac{n(100)}{100}=n$, which would mean the best scorer did better than all the people. That would be okay for "all the other people", but E can't have done better than emself!

Eric Stucky
  • 13,018
  • 1
    "Well, who is in the 100th percentile? Nobody, by definition." No, not if your data have finite support. You well could have empirical data in the 100th percentile. It's when you have infinite support that this assumption is important. –  Oct 13 '13 at 23:54
  • Is it true that nobody is in the 100th percentile by definition? The CFA book defines percentile as follows: "Given a set of observations, the yth percentile is the value at or below which y percent of observations lie." I would think that according to this definition (i.e., given the "at"), if there is a unique highest observation, that would be the 100th percentile. – dumb question Oct 14 '13 at 00:08
  • @dumbquestion: Well, under that definition, no, it's not, but that definition appears to be inconsistent with this formula. This becomes more apparent with 100 people with even score distribution, since it would claim the $50^\text{th}$ percentile would begin with person 50, but the formula would say it begins with person 51. So if you want to keep the interpretation of the median as the $50^\text{th}$ percentile (which you do, in fact) then I cannot justify using "at or". – Eric Stucky Oct 14 '13 at 07:37
  • @trb456: Perhaps we are using different definitions (I admit to being ignorant of conventions in statistics). I am using "The kth percentile is the subset of the data which excludes the lowest k%". Hm, but this also does not calculate medians correctly. What definition is correct? – Eric Stucky Oct 14 '13 at 07:52
  • @Eric Yes, that's the thought process I went through myself. Are these inconsistent, then? – dumb question Oct 14 '13 at 20:18
  • @dumbquestion: I'll defer that to trb since I seem to have a misunderstanding about the definitions. – Eric Stucky Oct 14 '13 at 22:09