5

I recently began a study of Introductory Statistics by Sheldon Ross. In the book, he defined a percentile as follows:

The sample $100p$ percentile is that data value having the property that at least $100p$ percent of the data are less than or equal to it and at least $100(1 − p)$ percent of the data values are greater than or equal to it. If two data values satisfy this condition, then the sample $100p$ percentile is the arithmetic average of these values.

This definition appears different from that on Wikipedia, which defines the $k$th percentile as the smallest data value that is greater than or equal to $k$% of the data set, and which I understand the intuitively.

I suspect the textbook is giving the more general definition, but I cannot see how so. In particular, once we have included the “smallest value” criterion, we can uniquely locate the $100p$ percentile by referencing the values less than or equal to it, so why do we need to refer to the values greater than the $100p$ percentile?

ryang
  • 44,428
Toba
  • 345

2 Answers2

2
  • What is the $40$th percentile of 0 1 2 3 ... 99?

    1. This list has $\color\red{100}$ values, so its $\color{blue}{40}$th percentile can be defined as having position index $\big(1+\color{blue}{40}\%(\color\red{100}-1)\big)=40.6,$ so linear interpolation gives the answer 39.60.

    2. Introductory Statistics by Sheldon Ross:

      The sample $100p$ percentile is that data value having the property that at least $100p$ percent of the data are less than or equal to it and at least $100(1 − p)$ percent of the data values are greater than or equal to it. If two data values satisfy this condition, then the sample $100p$ percentile is the arithmetic average of these values.

      This alternative algorithm gives the answer 39.5.

    3. Wikipedia defines the $k$th percentile as the smallest data value that is greater than or equal to $k$% of the data set.

      This definition gives the answer 39.

Only the simplistic third algorithm has these two issues: its $\boldsymbol0$th percentile is undefined, and it returns this list's $\boldsymbol{50}\textbf{th}$ percentile as 49 even as its median is 49.5.

  • For the values 0 1, the above algorithms return the $40\text{th}$ percentile as 0.40, 0 and 0, respectively, and the $45\text{th}$ percentile as 0.45, 0 and 0, respectively.
  • For the values 0 1 2 3 ... 12, the above algorithms return the $40\text{th}$ percentile as 4.80, 5 and 5, respectively, and the $45\text{th}$ percentile as 5.40, 5 and 5, respectively.

Notice that for these two lists, only the first algorithm returns distinct values for the $\boldsymbol{40}\textbf{th}$ and $\boldsymbol{45}\textbf{th}$ percentiles.

  • For the values 0 1 2 3 ... 50, all three algorithms agree that the $40\text{th}$ percentile is 20.

Reply to Rui's comment

R function 'quantile' has nine algorithms, the first three for discrete data. This computes all three percentiles: sapply(1:3, \(tt) quantile(0:99, probs = 0.4, type = tt)).

To be clear: all nine of R's 'quantile' algorithms apply to discrete data. Types $1–3$ return only the sample (i.e., dataset's) values or midpoints of successive order statistics; Types $4–9$ apply linear interpolation and produce continuous outputs. For example, with reference to the third list above, running sapply(1:9, function(tt) quantile(0:12, probs = 0.4, type = tt)) returns 5.000000 5.000000 4.000000 4.200000 4.700000 4.600000 4.800000 4.666667 4.675000.

Regardless of whether the population is discrete or continuous, R defaults to Type $\boldsymbol7,$ which corresponds to the first algorithm above.

ryang
  • 44,428
  • R function quantile has 9 algorithms, the first 3 for discontinuous data. This computes all three percentiles: sapply(1:3, \(tt) quantile(0:99, probs = 0.4, type = tt)). (Upvote.) – Rui Barradas Apr 20 '25 at 17:52
0

Have not read 'Introductory Statistics', by Sheldon Ross, so cannot say for certain what they are driving at or what they are trying to convey. I know in general terms, and you I am sure know this already that if p represents the probability of an event occurring, then probability of that event not occurring is (1 - p).

I learned that the percentile rank of a value would be that percentage of values in a dataset that are lower than that given value, $$PR = \frac{k}{n}*100$$, where k = number of values below, n = total number, and PR = Percentile Rank.

The percentile of a value corresponds to the percentage of values in the data set that are greater than or equal to that specified percentile rank mentioned above, $$P = \frac{n * p}{100}$$, where n = total number of values, p = desired percentile (i.e., say 40th in this case), P = rank position in the dataset.

To your ending question about why it being necessary to refer to values greater than the 100p percentile, an analogy may be if you wanted to know for example those students who scored at or above say the 95th percentile.

Hope my logic is clear.

fjm
  • 1
  • To be clear: the algorithm in Sheldon Ross's textbook doesn't correspond to the rank-position formula that you gave. For instance, the former returns the $40$th percentile of 0 1 2 3 ... 99 as 39.5, whereas your formula returns an integer. – ryang Apr 21 '25 at 11:35