
Recoding columns and replacing values
Source:vignettes/recoding-replacing.Rmd
recoding-replacing.RmdIntroduction
dplyr provides a family of functions for recoding columns and replacing values within a column. These are extremely common operations, so mastering this family can be a big productivity boost!
Before we begin, it’ll be helpful to define exactly what we mean by recoding vs replacing:
Recoding a column creates an entirely new column using values from an existing column. The new column may have a different type from the original column.
Replacing values within a column partially updates an existing column with new values. The result has the same type as the original column.
The family of functions can be summarized by the following table:
| Recoding | Replacing | |
|---|---|---|
| Match with conditions | case_when() |
replace_when() |
| Match with values | recode_values() |
replace_values() |
This vignette walks through use cases for each of these functions, which should help you build some intuition about when to use them.
case_when()
case_when() is the most general function in the family.
It works by evaluating each case sequentially and using the first match
for each element to determine the corresponding value in the output. To
demonstrate, we’ll look at a dataset of some 5k times in minutes:
set.seed(123)
racers <- tibble(
id = seq_len(100),
time = round(sample(1200:2100, size = 100, replace = TRUE) / 60, 2)
)
racers
#> # A tibble: 100 × 2
#> id time
#> <int> <dbl>
#> 1 1 26.9
#> 2 2 27.7
#> 3 3 23.0
#> 4 4 28.8
#> 5 5 23.2
#> 6 6 33.6
#> 7 7 22.0
#> 8 8 25.0
#> 9 9 23.8
#> 10 10 24.0
#> # ℹ 90 more rowsWe can use case_when() to categorize these times into
tiers:
tiers <- racers |>
mutate(
tier = case_when(
time < 23 ~ "A",
time < 27 ~ "B",
time < 30 ~ "C",
time < 33 ~ "D"
)
)
tiers
#> # A tibble: 100 × 3
#> id time tier
#> <int> <dbl> <chr>
#> 1 1 26.9 B
#> 2 2 27.7 C
#> 3 3 23.0 A
#> 4 4 28.8 C
#> 5 5 23.2 B
#> 6 6 33.6 NA
#> 7 7 22.0 A
#> 8 8 25.0 B
#> 9 9 23.8 B
#> 10 10 24.0 B
#> # ℹ 90 more rowsThere’s a few things to note here:
The first condition that is
TRUEis used, i.e. a time of 21 minutes meets all of the conditions, but would be placed in tierAbecausetime < 23is listed first.Unmatched values fall through as
NA. We have some racers above 33 minutes that aren’t captured here!
There are a few options for dealing with unmatched locations. You can
leave them as NA if that makes sense for your use case, or
you can specify a .default value:
racers |>
mutate(
tier = case_when(
time < 23 ~ "A",
time < 27 ~ "B",
time < 30 ~ "C",
time < 33 ~ "D",
.default = "unknown"
)
)
#> # A tibble: 100 × 3
#> id time tier
#> <int> <dbl> <chr>
#> 1 1 26.9 B
#> 2 2 27.7 C
#> 3 3 23.0 A
#> 4 4 28.8 C
#> 5 5 23.2 B
#> 6 6 33.6 unknown
#> 7 7 22.0 A
#> 8 8 25.0 B
#> 9 9 23.8 B
#> 10 10 24.0 B
#> # ℹ 90 more rowsIf you are confident that you’ve captured every case, you can supply
.unmatched = "error" rather than .default and
case_when() will error if that assertion doesn’t hold. This
is great for defensive programming!
racers |>
mutate(
tier = case_when(
time < 23 ~ "A",
time < 27 ~ "B",
time < 30 ~ "C",
time < 33 ~ "D",
.unmatched = "error"
)
)
#> Error in `mutate()`:
#> ℹ In argument: `tier = case_when(...)`.
#> Caused by error in `case_when()`:
#> ! Each location must be matched.
#> ✖ Locations 6, 22, 32, 34, 40, 44, 55, 56, 73, 79, 84, and 93 are unmatched.Note that missing values must be explicitly handled when setting
.unmatched = "error", even if that’s just
is.na(time) ~ NA, otherwise they will trigger the unmatched
error.
replace_when()
Let’s assume that some of our racers used banned running shoes and are disqualified. Also, some racers had a false start and need to incur a 20 second (1/3 minute) penalty.
We could add this information in a few ways. With
case_when():
racers |>
mutate(
time = case_when(
id %in% id_banned_shoes ~ NA,
id %in% id_false_start ~ time + 1 / 3,
.default = time
)
)
#> # A tibble: 100 × 2
#> id time
#> <int> <dbl>
#> 1 1 27.2
#> 2 2 NA
#> 3 3 23.0
#> 4 4 28.8
#> 5 5 23.6
#> 6 6 33.6
#> 7 7 22.0
#> 8 8 25.0
#> 9 9 23.8
#> 10 10 NA
#> # ℹ 90 more rowsOr in two steps with if_else():
racers |>
mutate(time = if_else(id %in% id_banned_shoes, NA, time)) |>
mutate(time = if_else(id %in% id_false_start, time + 1 / 3, time))
#> # A tibble: 100 × 2
#> id time
#> <int> <dbl>
#> 1 1 27.2
#> 2 2 NA
#> 3 3 23.0
#> 4 4 28.8
#> 5 5 23.6
#> 6 6 33.6
#> 7 7 22.0
#> 8 8 25.0
#> 9 9 23.8
#> 10 10 NA
#> # ℹ 90 more rowsNeither of these feel particularly elegant at expressing the
intent of this operation. All you’re trying to do is replace a
few values of time! We like to think of time
as the primary input: time goes in, and
time comes out (slightly adjusted). But both
case_when() and if_else() have
time as their last input, making the intent a bit hard to
understand at first glance.
replace_when() lets you pull the primary input to the
front (which also makes it compatible with the pipe!), making the intent
more clear:
racers |>
mutate(
time = time |>
replace_when(
id %in% id_banned_shoes ~ NA,
id %in% id_false_start ~ time + 1 / 3
)
)
#> # A tibble: 100 × 2
#> id time
#> <int> <dbl>
#> 1 1 27.2
#> 2 2 NA
#> 3 3 23.0
#> 4 4 28.8
#> 5 5 23.6
#> 6 6 33.6
#> 7 7 22.0
#> 8 8 25.0
#> 9 9 23.8
#> 10 10 NA
#> # ℹ 90 more rowsAs a side note, you might have been tempted to reach for
base::replace() here, i.e. as:
racers |>
mutate(time = base::replace(time, id %in% id_banned_shoes, NA)) |>
mutate(time = base::replace(time, id %in% id_false_start, time + 1 / 3))This actually doesn’t work! Replacing with NA does work,
but replace() requires that the result of
time + 1 / 3 must be preemptively subset to the places
where the condition is true. You’d have to do something more complicated
to mimic replace_when():
racers |>
mutate(time = base::replace(time, id %in% id_banned_shoes, NA)) |>
mutate(time = {
loc <- id %in% id_false_start
base::replace(time, loc, time[loc] + 1 / 3)
})
#> # A tibble: 100 × 2
#> id time
#> <int> <dbl>
#> 1 1 27.2
#> 2 2 NA
#> 3 3 23.0
#> 4 4 28.8
#> 5 5 23.6
#> 6 6 33.6
#> 7 7 22.0
#> 8 8 25.0
#> 9 9 23.8
#> 10 10 NA
#> # ℹ 90 more rowsType stability
Beyond readability, an important benefit of
replace_when() (and replace_values(), which
we’ll see later) is that it is type stable on the column you
are modifying, which means that it can’t change types out from under
you.
Type stability is particularly useful with factors. Taking another
look at our tiers of race times, imagine that some of the
race times were discovered to be faulty due to malfunctioning timers,
and you need to replace a few ids with the
unknown level.
id_with_malfunction <- c(1, 5, 20, 50)
tiers <- racers |>
mutate(
tier = case_when(
time < 23 ~ "A",
time < 27 ~ "B",
time < 30 ~ "C",
time < 33 ~ "D",
.default = "unknown"
) |>
factor(levels = c("A", "B", "C", "D", "unknown"))
)
tiers
#> # A tibble: 100 × 3
#> id time tier
#> <int> <dbl> <fct>
#> 1 1 26.9 B
#> 2 2 27.7 C
#> 3 3 23.0 A
#> 4 4 28.8 C
#> 5 5 23.2 B
#> 6 6 33.6 unknown
#> 7 7 22.0 A
#> 8 8 25.0 B
#> 9 9 23.8 B
#> 10 10 24.0 B
#> # ℹ 90 more rowsNote that the following case_when() solution results in
tier becoming a character column, losing its
factor class. This is due to the fact that case_when() is a
recoding function, it creates an entirely new column and
doesn’t know that you’re trying to retain existing type information.
tiers |>
mutate(
tier = case_when(id %in% id_with_malfunction ~ "unknown", .default = tier)
)
#> # A tibble: 100 × 3
#> id time tier
#> <int> <dbl> <chr>
#> 1 1 26.9 unknown
#> 2 2 27.7 C
#> 3 3 23.0 A
#> 4 4 28.8 C
#> 5 5 23.2 unknown
#> 6 6 33.6 unknown
#> 7 7 22.0 A
#> 8 8 25.0 B
#> 9 9 23.8 B
#> 10 10 24.0 B
#> # ℹ 90 more rowsAs a replacing function, replace_when() knows
to be type stable on tier, and casts "unknown"
to tier’s factor type before performing the
replacement:
tiers |>
mutate(
tier = tier |> replace_when(id %in% id_with_malfunction ~ "unknown")
)
#> # A tibble: 100 × 3
#> id time tier
#> <int> <dbl> <fct>
#> 1 1 26.9 unknown
#> 2 2 27.7 C
#> 3 3 23.0 A
#> 4 4 28.8 C
#> 5 5 23.2 unknown
#> 6 6 33.6 unknown
#> 7 7 22.0 A
#> 8 8 25.0 B
#> 9 9 23.8 B
#> 10 10 24.0 B
#> # ℹ 90 more rows
recode_values()
case_when() and replace_when() both take
logical vectors on the left-hand side of the formula. This is
very flexible, but sometimes these functions require a large amount of
repetition. Consider the following Likert scale
scores. We’d like to recode these from their numeric values to their
character counterparts.
We could certainly use a case_when():
likert |>
mutate(
score = case_when(
score == 1 ~ "Strongly disagree",
score == 2 ~ "Disagree",
score == 3 ~ "Neutral",
score == 4 ~ "Agree",
score == 5 ~ "Strongly agree"
)
)
#> # A tibble: 9 × 1
#> score
#> <chr>
#> 1 Strongly disagree
#> 2 Disagree
#> 3 Neutral
#> 4 Agree
#> 5 Strongly agree
#> 6 Disagree
#> 7 Neutral
#> 8 Strongly disagree
#> 9 AgreeBut score == is repeated many times! If you find
yourself using == or %in% on the left-hand
side in this manner, you likely want to use recode_values()
instead. Rather than taking logical vectors,
recode_values() takes values on the left-hand side
to match against a single input that you’ll provide as the first
argument.
likert |>
mutate(
score = score |>
recode_values(
1 ~ "Strongly disagree",
2 ~ "Disagree",
3 ~ "Neutral",
4 ~ "Agree",
5 ~ "Strongly agree"
)
)
#> # A tibble: 9 × 1
#> score
#> <chr>
#> 1 Strongly disagree
#> 2 Disagree
#> 3 Neutral
#> 4 Agree
#> 5 Strongly agree
#> 6 Disagree
#> 7 Neutral
#> 8 Strongly disagree
#> 9 AgreeThis removes all of the repetition, allowing you to focus on the
mapping. If you squint, the mapping should look roughly like a lookup
table between the numeric value and the likert encoding.
recode_values() actually has a second interface that allows
us to make this lookup table representation even more explicit.
Using a tribble(), we can extract out the lookup table
into its own standalone data frame.
lookup <- tribble(
~from , ~to ,
1 , "Strongly disagree" ,
2 , "Disagree" ,
3 , "Neutral" ,
4 , "Agree" ,
5 , "Strongly agree"
)We can then utilize the alternative from and
to arguments of recode_values() rather than
supplying formulas to specify how the values should be recoded:
likert |>
mutate(score = recode_values(score, from = lookup$from, to = lookup$to))
#> # A tibble: 9 × 1
#> score
#> <chr>
#> 1 Strongly disagree
#> 2 Disagree
#> 3 Neutral
#> 4 Agree
#> 5 Strongly agree
#> 6 Disagree
#> 7 Neutral
#> 8 Strongly disagree
#> 9 AgreeLifting the lookup table to the top of the file is particularly nice when you have a long pipe chain. The details of the mapping get some room to breathe, and in the pipe chain you can focus on the actual manipulations you are trying to perform.
It’s also very common for your lookup table to exist in
a CSV file that you have to read in separately. In that case, you can
replace the tribble() call with:
lookup <- readr::read_csv("lookup.csv")But everything else works the same. This would be quite hard to specify with just the formula interface!
Like case_when(), recode_values() also has
default and unmatched arguments to handle
unmatched locations:
likert <- tibble(
score = c(0, 1, 2, 2, 4, 5, 2, 3, 1, 4)
)
# Missed the `0`
likert |>
mutate(
score = score |>
recode_values(
from = lookup$from,
to = lookup$to,
unmatched = "error"
)
)
#> Error in `mutate()`:
#> ℹ In argument: `score = recode_values(score, from =
#> lookup$from, to = lookup$to, unmatched = "error")`.
#> Caused by error in `recode_values()`:
#> ! Each location must be matched.
#> ✖ Location 1 is unmatched.
replace_values()
As seen above, when replacing a few locations in a column using
logical conditions, we reached for replace_when()
rather than case_when(). Similarly, when replacing a few
locations using values to match against, it’s best to use
replace_values() over recode_values().
Imagine we’d like to collapse some, but not all, of these school names into common buckets:
schools <- tibble(
name = c(
"UNC",
"Chapel Hill",
NA,
"Duke",
"Duke University",
"UNC",
"NC State",
"ECU"
)
)We could use recode_values():
schools |>
mutate(
name = recode_values(
name,
c("UNC", "Chapel Hill") ~ "UNC Chapel Hill",
c("Duke", "Duke University") ~ "Duke",
default = name
)
)
#> # A tibble: 8 × 1
#> name
#> <chr>
#> 1 UNC Chapel Hill
#> 2 UNC Chapel Hill
#> 3 NA
#> 4 Duke
#> 5 Duke
#> 6 UNC Chapel Hill
#> 7 NC State
#> 8 ECUBut this “partial update by value” is so common that it really
deserves its own name that doesn’t require you to specify
default. For that, we have
replace_values():
schools |>
mutate(
name = name |>
replace_values(
c("UNC", "Chapel Hill") ~ "UNC Chapel Hill",
c("Duke", "Duke University") ~ "Duke"
)
)
#> # A tibble: 8 × 1
#> name
#> <chr>
#> 1 UNC Chapel Hill
#> 2 UNC Chapel Hill
#> 3 NA
#> 4 Duke
#> 5 Duke
#> 6 UNC Chapel Hill
#> 7 NC State
#> 8 ECULike recode_values(), replace_values() has
an alternative from and to API that works well
with lookup tables and allows you to move your mapping out of the pipe
chain:
lookup <- tribble(
~from , ~to ,
"UNC" , "UNC Chapel Hill" ,
"Chapel Hill" , "UNC Chapel Hill" ,
"Duke" , "Duke" ,
"Duke University" , "Duke"
)
schools |>
mutate(name = replace_values(name, from = lookup$from, to = lookup$to))
#> # A tibble: 8 × 1
#> name
#> <chr>
#> 1 UNC Chapel Hill
#> 2 UNC Chapel Hill
#> 3 NA
#> 4 Duke
#> 5 Duke
#> 6 UNC Chapel Hill
#> 7 NC State
#> 8 ECUAn extremely neat feature of the from and
to API is that they also take lists of vectors
that describe the mapping, which has been designed to work elegantly
with the fact that tribble() can create list columns,
allowing you to further collapse this lookup table:
# Condensed lookup table with a `many:1` mapping per row
lookup <- tribble(
~from , ~to ,
c("UNC", "Chapel Hill") , "UNC Chapel Hill" ,
c("Duke", "Duke University") , "Duke"
)
# Note that `from` is a list column
lookup
#> # A tibble: 2 × 2
#> from to
#> <list> <chr>
#> 1 <chr [2]> UNC Chapel Hill
#> 2 <chr [2]> Duke
lookup$from
#> [[1]]
#> [1] "UNC" "Chapel Hill"
#>
#> [[2]]
#> [1] "Duke" "Duke University"
# Works the same as before
schools |>
mutate(name = replace_values(name, from = lookup$from, to = lookup$to))
#> # A tibble: 8 × 1
#> name
#> <chr>
#> 1 UNC Chapel Hill
#> 2 UNC Chapel Hill
#> 3 NA
#> 4 Duke
#> 5 Duke
#> 6 UNC Chapel Hill
#> 7 NC State
#> 8 ECUComparisons
We’ll end this vignette with some comparisons of the recoding and replacing family to other dplyr functions and to other technologies, like SQL.
if_else()
if_else() is a type of recoding function, as it creates
an entirely new column. In fact, it’s closely tied to
case_when():
if_else(condition, true, false, missing)
case_when(
condition ~ true,
!condition ~ false,
is.na(condition) ~ missing
)Similar to case_when(), if_else() doesn’t
offer type stability on any particular input. The output’s type is
computed as the common type of true, false,
and missing. If you find yourself writing an
if_else() where the purpose is to partially update an
existing column, consider using replace_when() instead for
clarity and type stability:
x <- if_else(x > 5, new, x)
# Type stable on `x`.
# Intent of "partially updating" `x` is clear.
# Pipe friendly.
x <- x |> replace_when(x > 5 ~ new)
coalesce()
For converting from NA to some other value, the most
common cases of coalesce() are often a
replace_values() call in disguise:
x <- c(1, 2, NA, 3, NA, 5)
y <- c(0, 3, 1, 4, 6, 7)
coalesce(x, 0)
#> [1] 1 2 0 3 0 5
replace_values(x, NA ~ 0)
#> [1] 1 2 0 3 0 5
coalesce(x, y)
#> [1] 1 2 1 3 6 5
replace_values(x, NA ~ y)
#> [1] 1 2 1 3 6 5And with replace_values() you can replace any value, not
just NA.
na_if()
For converting from a problematic value to NA,
replace_values() is a more flexible (and likely more
intuitive) alternative to na_if():
x <- c(1, 2, 0, -99, 12)
# To convert `0` and `-99` to `NA`, you have to do it in two calls
x |> na_if(0) |> na_if(-99)
#> [1] 1 2 NA NA 12
x |> replace_values(from = c(0, -99), to = NA)
#> [1] 1 2 NA NA 12SQL
case_when() is an R equivalent of SQL’s Searched
CASE statement:
case_when(
x < 100 ~ this,
x < 20 ~ that,
.default = default
)And dbplyr will translate a case_when() to this
form!
recode_values() is an R equivalent of SQL’s Simple
CASE statement:
recode_values(
x,
"E" ~ "East",
"W" ~ "West",
"N" ~ "North",
"S" ~ "South",
.default = "Unknown"
)CASE x
WHEN 'E' THEN 'East'
WHEN 'W' THEN 'West'
WHEN 'N' THEN 'North'
WHEN 'S' THEN 'South'
ELSE 'Unknown'
ENDAs of dbplyr 2.5.1, we don’t currently have a translation for
recode_values() since it is so new, but we expect to have
one soon.