These two ranking functions implement two slightly different ways to
compute a percentile. For each x_i
in x
:
cume_dist(x)
counts the total number of values less than or equal tox_i
, and divides it by the number of observations.percent_rank(x)
counts the total number of values less thanx_i
, and divides it by the number of observations minus 1.
In both cases, missing values are ignored when counting the number of observations.
Arguments
- x
A vector to rank
By default, the smallest values will get the smallest ranks. Use
desc()
to reverse the direction so the largest values get the smallest ranks.Missing values will be given rank
NA
. Usecoalesce(x, Inf)
orcoalesce(x, -Inf)
if you want to treat them as the largest or smallest values respectively.To rank by multiple columns at once, supply a data frame.
See also
Other ranking functions:
ntile()
,
row_number()
Examples
x <- c(5, 1, 3, 2, 2)
cume_dist(x)
#> [1] 1.0 0.2 0.8 0.6 0.6
percent_rank(x)
#> [1] 1.00 0.00 0.75 0.25 0.25
# You can understand what's going on by computing it by hand
sapply(x, function(xi) sum(x <= xi) / length(x))
#> [1] 1.0 0.2 0.8 0.6 0.6
sapply(x, function(xi) sum(x < xi) / (length(x) - 1))
#> [1] 1.00 0.00 0.75 0.25 0.25
# The real computations are a little more complex in order to
# correctly deal with missing values