recode_values()
and replace_values()
provide two ways to map old values
to new values. They work by matching values against x
and using the first
match to determine the corresponding value in the output vector. You can also
think of these functions as a way to use a lookup table to recode a vector.
Use
recode_values()
when creating an entirely new vector.Use
replace_values()
when partially updating an existing vector.
If you are just replacing a few values within an existing vector, then
replace_values()
is always a better choice because it is type stable and
better expresses intent.
A major difference between the two functions is what happens when no cases match:
recode_values()
falls through to adefault
.replace_values()
retains the original values fromx
.
These functions have two mutually exclusive ways to use them:
A formula-based approach, i.e.
recode_values(x, from1 ~ to1, from2 ~ to2)
, similar tocase_when()
, which is useful when you have a small number of cases.A vector-based approach, i.e.
recode_values(x, from = from, to = to)
, which is useful when you have a pre-built lookup table (which may come from an external source, like a CSV file).
Usage
recode_values(
x,
...,
from = NULL,
to = NULL,
default = NULL,
unmatched = "default",
ptype = NULL
)
replace_values(x, ..., from = NULL, to = NULL)
Arguments
- x
A vector.
- ...
<
dynamic-dots
> A sequence of two-sided formulas. The left hand side (LHS) determines which values match this case. The right hand side (RHS) provides the replacement value.The LHS inputs can be any size, but will be cast to the type of
x
.The RHS inputs will be recycled to the same size as
x
. Forrecode_values()
they will be cast to their common type, and forreplace_values()
they will be cast to the type ofx
.
NULL
inputs are ignored.Mutually exclusive with
from
andto
.- from
Values to look up in
x
and map to values into
.Typically this is a single vector of any size that is cast to the type of
x
. For more advanced usage, this can be a list of vectors of any size each of which are cast to the type ofx
.Mutually exclusive with
...
.- to
Values that
from
map to.Typically this is a single vector that is recycled to the size of
from
. For more advanced usage, this can be a list of vectors each of which are recycled to the size ofx
.Mutually exclusive with
...
.- default
Default value to use when there is a value present in
x
that is unmatched by a value infrom
.By default, a missing value is used as the default value.
If supplied, will be recycled to the size of
x
.Can only be set when
unmatched = "default"
.- unmatched
Handling of unmatched locations.
One of:
"default"
to usedefault
in unmatched locations."error"
to error when there are unmatched locations.
- ptype
An optional override for the output type, which is usually computed as the common type of
to
anddefault
.
Value
A vector the same size as x
.
For
recode_values()
, the type of the output is computed as the common type ofto
anddefault
, unless overridden byptype
. The names of the output come from the names ofto
anddefault
.For
replace_values()
, the type of the output will have the same type asx
. The names of the output will be the same as the names ofx
.
Examples
x <- c("NC", "NYC", "CA", NA, "NYC", "Unknown")
# `recode_values()` is useful for fully recoding from one set of values to
# another, creating an entirely new vector in the process. Note that any
# unmatched values result in `NA`, or a `default` value.
recode_values(
x,
"NC" ~ "North Carolina",
"NYC" ~ "New York",
"CA" ~ "California"
)
#> [1] "North Carolina" "New York" "California" NA
#> [5] "New York" NA
recode_values(
x,
"NC" ~ "North Carolina",
"NYC" ~ "New York",
"CA" ~ "California",
default = "<not recorded>"
)
#> [1] "North Carolina" "New York" "California" "<not recorded>"
#> [5] "New York" "<not recorded>"
# `replace_values()` is useful for updating an existing vector, tweaking a
# few values along the way
replace_values(x, "NYC" ~ "NY")
#> [1] "NC" "NY" "CA" NA "NY" "Unknown"
# `replace_values()` is particularly nice for replacing `NA`s with values...
replace_values(x, NA ~ "Unknown (NA)")
#> [1] "NC" "NYC" "CA" "Unknown (NA)"
#> [5] "NYC" "Unknown"
# ...or values with `NA`s
replace_values(x, "Unknown" ~ NA)
#> [1] "NC" "NYC" "CA" NA "NYC" NA
# Multiple values can be grouped within a single left-hand side to normalize
# all problematic values at once
replace_values(x, c(NA, "Unknown") ~ "<not recorded>")
#> [1] "NC" "NYC" "CA" "<not recorded>"
#> [5] "NYC" "<not recorded>"
# ---------------------------------------------------------------------------
# Lookup tables
# `recode_values()` works with more than just character vectors. Imagine you
# have this series of Likert Scale scores, which is a scoring system that is
# ordered from 1-5.
data <- tibble(
score = c(1, 2, 3, 4, 5, 2, 3, 1, 4)
)
# To recode each `score` to its corresponding Likert Score label, you may
# initially be inclined to reach for `case_when()`
data |>
mutate(
score = case_when(
score == 1 ~ "Strongly disagree",
score == 2 ~ "Disagree",
score == 3 ~ "Neutral",
score == 4 ~ "Agree",
score == 5 ~ "Strongly agree"
)
)
#> # A tibble: 9 × 1
#> score
#> <chr>
#> 1 Strongly disagree
#> 2 Disagree
#> 3 Neutral
#> 4 Agree
#> 5 Strongly agree
#> 6 Disagree
#> 7 Neutral
#> 8 Strongly disagree
#> 9 Agree
# While this works, it can be written more efficiently using
# `recode_values()`
data |>
mutate(
score = score |>
recode_values(
1 ~ "Strongly disagree",
2 ~ "Disagree",
3 ~ "Neutral",
4 ~ "Agree",
5 ~ "Strongly agree"
)
)
#> # A tibble: 9 × 1
#> score
#> <chr>
#> 1 Strongly disagree
#> 2 Disagree
#> 3 Neutral
#> 4 Agree
#> 5 Strongly agree
#> 6 Disagree
#> 7 Neutral
#> 8 Strongly disagree
#> 9 Agree
# `recode_values()` actually has two mutually exclusive APIs. The formula API
# used above, which is like `case_when()`, and a lookup style API that uses
# `from` and `to` arguments. The lookup API is even better suited for this
# problem, because we can move the mapping outside of the `mutate()` call
# into a standalone lookup table. You could even imagine reading this
# `likert` lookup table in from a separate CSV file.
likert <- tribble(
~from, ~to,
1, "Strongly disagree",
2, "Disagree",
3, "Neutral",
4, "Agree",
5, "Strongly agree"
)
data |>
mutate(score = recode_values(score, from = likert$from, to = likert$to))
#> # A tibble: 9 × 1
#> score
#> <chr>
#> 1 Strongly disagree
#> 2 Disagree
#> 3 Neutral
#> 4 Agree
#> 5 Strongly agree
#> 6 Disagree
#> 7 Neutral
#> 8 Strongly disagree
#> 9 Agree
# The `unmatched` argument allows you to assert that you believe that you've
# recoded all of the cases and will error if you've missed one, adding an
# extra layer of safety
data_with_zero <- add_row(data, score = 0)
try({
recode_values(
data_with_zero$score,
from = likert$from,
to = likert$to,
unmatched = "error"
)
})
#> Error in recode_values(data_with_zero$score, from = likert$from, to = likert$to, :
#> Each location must be matched.
#> ✖ Location 10 is unmatched.
# Note that missing values are considered unmatched. If you expect missing
# values, you'll need to handle them explicitly in your lookup table.
data_with_missing <- add_row(data, score = NA)
try({
recode_values(
data_with_missing$score,
from = likert$from,
to = likert$to,
unmatched = "error"
)
})
#> Error in recode_values(data_with_missing$score, from = likert$from, to = likert$to, :
#> Each location must be matched.
#> ✖ Location 10 is unmatched.
likert <- add_row(likert, from = NA, to = NA)
recode_values(
data_with_missing$score,
from = likert$from,
to = likert$to,
unmatched = "error"
)
#> [1] "Strongly disagree" "Disagree" "Neutral"
#> [4] "Agree" "Strongly agree" "Disagree"
#> [7] "Neutral" "Strongly disagree" "Agree"
#> [10] NA
# ------------------------------------------------------------------------------
# Lists of vectors
# In some cases, your mapping may collapse multiple groups together into a
# single value. For example, here we'd like to standardize the school names.
schools <- c(
"UNC",
"Chapel Hill",
NA,
"Duke",
"Duke University",
"UNC",
"NC State",
"ECU",
"East Carolina"
)
# This `tribble()` is more complex than it may appear, it actually
# creates a list column!
standardized <- tribble(
~from, ~to,
c("UNC", "Chapel Hill"), "UNC",
c("Duke", "Duke University"), "Duke",
c("NC State"), "NC State",
c("ECU", "East Carolina"), "ECU",
NA, NA
)
standardized
#> # A tibble: 5 × 2
#> from to
#> <list> <chr>
#> 1 <chr [2]> UNC
#> 2 <chr [2]> Duke
#> 3 <chr [1]> NC State
#> 4 <chr [2]> ECU
#> 5 <lgl [1]> NA
standardized$from
#> [[1]]
#> [1] "UNC" "Chapel Hill"
#>
#> [[2]]
#> [1] "Duke" "Duke University"
#>
#> [[3]]
#> [1] "NC State"
#>
#> [[4]]
#> [1] "ECU" "East Carolina"
#>
#> [[5]]
#> [1] NA
#>
# `recode_values()` treats a list `from` value as a list of vectors, where
# any match within one of the vectors is mapped to its corresponding `to`
# value
recode_values(
schools,
from = standardized$from,
to = standardized$to,
unmatched = "error"
)
#> [1] "UNC" "UNC" NA "Duke" "Duke" "UNC"
#> [7] "NC State" "ECU" "ECU"
# This formula based approach is equivalent, but the lookup based approach is
# nicer because the lookup table can be defined separately
recode_values(
schools,
c("UNC", "Chapel Hill") ~ "UNC",
c("Duke", "Duke University") ~ "Duke",
c("NC State") ~ "NC State",
c("ECU", "East Carolina") ~ "ECU",
NA ~ NA,
unmatched = "error"
)
#> [1] "UNC" "UNC" NA "Duke" "Duke" "UNC"
#> [7] "NC State" "ECU" "ECU"