It’s often useful to perform the same operation on multiple columns, but copying and pasting is both tedious and error prone:
(If you’re trying to compute mean(a, b, c, d)
for each
row, instead see vignette("rowwise")
This vignette will introduce you to the across()
function, which lets you rewrite the previous code more succinctly:
We’ll start by discussing the basic usage of across()
particularly as it applies to summarise()
, and show how to
use it with multiple functions. We’ll then show a few uses with other
verbs. We’ll finish off with a bit of history, showing why we prefer
to our last approach (the _if()
and _all()
functions) and how to
translate your old code to the new syntax.
Basic usage
has two primary arguments:
The first argument,
, selects the columns you want to operate on. It uses tidy selection (likeselect()
) so you can pick variables by position, name, and type.The second argument,
, is a function or list of functions to apply to each column. This can also be a purrr style formula (or list of formulas) like~ .x / 2
. (This argument is optional, and you can omit it if you just want to get the underlying data; you’ll see that technique used invignette("rowwise")
Here are a couple of examples of across()
in conjunction
with its favourite verb, summarise()
. But you can use
with any dplyr verb, as you’ll see a little
starwars %>%
summarise(across(where(is.character), n_distinct))
#> # A tibble: 1 × 8
#> name hair_color skin_color eye_color sex gender homeworld species
#> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 87 12 31 15 5 3 49 38
starwars %>%
group_by(species) %>%
filter(n() > 1) %>%
summarise(across(c(sex, gender, homeworld), n_distinct))
#> # A tibble: 9 × 4
#> species sex gender homeworld
#> <chr> <int> <int> <int>
#> 1 Droid 1 2 3
#> 2 Gungan 1 1 1
#> 3 Human 2 2 15
#> 4 Kaminoan 2 2 1
#> # ℹ 5 more rows
starwars %>%
group_by(homeworld) %>%
filter(n() > 1) %>%
summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))
#> # A tibble: 10 × 4
#> homeworld height mass birth_year
#> <chr> <dbl> <dbl> <dbl>
#> 1 Alderaan 176. 64 43
#> 2 Corellia 175 78.5 25
#> 3 Coruscant 174. 50 91
#> 4 Kamino 208. 83.1 31.5
#> # ℹ 6 more rows
Because across()
is usually used in combination with
and mutate()
, it doesn’t select
grouping variables in order to avoid accidentally modifying them:
df <- data.frame(g = c(1, 1, 2), x = c(-1, 1, 3), y = c(-1, -4, -9))
df %>%
group_by(g) %>%
summarise(across(where(is.numeric), sum))
#> # A tibble: 2 × 3
#> g x y
#> <dbl> <dbl> <dbl>
#> 1 1 0 -5
#> 2 2 3 -9
Multiple functions
You can transform each variable with more than one function by supplying a named list of functions or lambda functions in the second argument:
min_max <- list(
min = ~min(.x, na.rm = TRUE),
max = ~max(.x, na.rm = TRUE)
starwars %>% summarise(across(where(is.numeric), min_max))
#> # A tibble: 1 × 6
#> height_min height_max mass_min mass_max birth_year_min birth_year_max
#> <int> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 66 264 15 1358 8 896
starwars %>% summarise(across(c(height, mass, birth_year), min_max))
#> # A tibble: 1 × 6
#> height_min height_max mass_min mass_max birth_year_min birth_year_max
#> <int> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 66 264 15 1358 8 896
Control how the names are created with the .names
argument which takes a glue
starwars %>% summarise(across(where(is.numeric), min_max, .names = "{.fn}.{.col}"))
#> # A tibble: 1 × 6
#> min.height max.height min.mass max.mass min.birth_year max.birth_year
#> <int> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 66 264 15 1358 8 896
starwars %>% summarise(across(c(height, mass, birth_year), min_max, .names = "{.fn}.{.col}"))
#> # A tibble: 1 × 6
#> min.height max.height min.mass max.mass min.birth_year max.birth_year
#> <int> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 66 264 15 1358 8 896
If you’d prefer all summaries with the same function to be grouped together, you’ll have to expand the calls yourself:
starwars %>% summarise(
across(c(height, mass, birth_year), ~min(.x, na.rm = TRUE), .names = "min_{.col}"),
across(c(height, mass, birth_year), ~max(.x, na.rm = TRUE), .names = "max_{.col}")
#> # A tibble: 1 × 6
#> min_height min_mass min_birth_year max_height max_mass max_birth_year
#> <int> <dbl> <dbl> <int> <dbl> <dbl>
#> 1 66 15 8 264 1358 896
(One day this might become an argument to across()
we’re not yet sure how it would work.)
We cannot however use where(is.numeric)
in that last
case because the second across()
would pick up the
variables that were newly created (“min_height”, “min_mass” and
We can work around this by combining both calls to
into a single expression that returns a
starwars %>% summarise(
across(where(is.numeric), ~min(.x, na.rm = TRUE), .names = "min_{.col}"),
across(where(is.numeric), ~max(.x, na.rm = TRUE), .names = "max_{.col}")
#> # A tibble: 1 × 6
#> min_height min_mass min_birth_year max_height max_mass max_birth_year
#> <int> <dbl> <dbl> <int> <dbl> <dbl>
#> 1 66 15 8 264 1358 896
Alternatively we could reorganize results with
Current column
If you need to, you can access the name of the “current” column
inside by calling cur_column()
. This can be useful if you
want to perform some sort of context dependent transformation that’s
already encoded in a vector:
Be careful when combining numeric summaries with
df <- data.frame(x = c(1, 2, 3), y = c(1, 4, 9))
df %>%
summarise(n = n(), across(where(is.numeric), sd))
#> n x y
#> 1 NA 1 4.041452
Here n
becomes NA
because n
numeric, so the across()
computes its standard deviation,
and the standard deviation of 3 (a constant) is NA
. You
probably want to compute n()
last to avoid this
Alternatively, you could explicitly exclude n
from the
columns to operate on:
Another approach is to combine both the call to n()
in a single expression that returns a tibble:
Other verbs
So far we’ve focused on the use of across()
, but it works with any other dplyr verb that
uses data masking:
Rescale all numeric variables to range 0-1:
For some verbs, like group_by()
, count()
and distinct()
, you don’t need to supply a summary
function, but it can be useful to use tidy-selection to dynamically
select a set of columns. In those cases, we recommend using the
complement to across()
, pick()
, which works
like across()
but doesn’t apply any functions and instead
returns a data frame containing the selected columns.
Find all distinct
Count all combinations of variables with a given pattern:
doesn’t work with select()
because they already use tidy select syntax; if
you want to transform column names with a function, you can use
We cannot directly use across()
in filter()
because we need an extra step to combine the results. To that end,
has two special purpose companion functions:
keeps the rows where the predicate is true for at least one selected column:
starwars %>%
filter(if_any(everything(), ~ !
#> # A tibble: 87 × 14
#> name height mass hair_color skin_color eye_color birth_year sex
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Luke S… 172 77 blond fair blue 19 male
#> 2 C-3PO 167 75 NA gold yellow 112 none
#> 3 R2-D2 96 32 NA white, bl… red 33 none
#> 4 Darth … 202 136 none white yellow 41.9 male
#> # ℹ 83 more rows
#> # ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
#> # films <list>, vehicles <list>, starships <list>
keeps the rows where the predicate is true for all selected columns:
starwars %>%
filter(if_all(everything(), ~ !
#> # A tibble: 29 × 14
#> name height mass hair_color skin_color eye_color birth_year sex
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Luke S… 172 77 blond fair blue 19 male
#> 2 Darth … 202 136 none white yellow 41.9 male
#> 3 Leia O… 150 49 brown light brown 19 fema…
#> 4 Owen L… 178 120 brown, gr… light blue 52 male
#> # ℹ 25 more rows
#> # ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
#> # films <list>, vehicles <list>, starships <list>
, _at
, _all
Prior versions of dplyr allowed you to apply a function to multiple
columns in a different way: using functions with _if
, and _all()
suffixes. These functions
solved a pressing need and are used by many people, but are now
superseded. That means that they’ll stay around, but won’t receive any
new features and will only get critical bug fixes.
Why do we like across()
Why did we decide to move away from these functions in favour of
makes it possible to express useful summaries that were previously impossible: across()
reduces the number of functions that dplyr needs to provide. This makes dplyr easier for you to use (because there are fewer functions to remember) and easier for us to implement new verbs (since we only need to implement one function, not four).across()
semantics so that you can select by position, name, and type, and you can now create compound selections that were previously impossible. For example, you can now transform all numeric columns whose name begins with “x”:across(where(is.numeric) & starts_with("x"))
doesn’t need to usevars()
. The_at()
functions are the only place in dplyr where you have to manually quote variable names, which makes them a little weird and hence harder to remember.
Why did it take so long to discover across()
It’s disappointing that we didn’t discover across()
earlier, and instead worked through several false starts (first not
realising that it was a common problem, then with the
functions, and most recently with the
But across()
couldn’t work without three recent
You can have a column of a data frame that is itself a data frame. This is something provided by base R, but it’s not very well documented, and it took a while to see that it was useful, not just a theoretical curiosity.
We can use data frames to allow summary functions to return multiple columns.
We can use the absence of an outer name as a convention that you want to unpack a data frame column into individual columns.
How do you convert existing code?
Fortunately, it’s generally straightforward to translate your
existing code to use across()
Strip the
suffix off the function.-
. The first argument will be:- For
, the old second argument wrapped inwhere()
. - For
, the old second argument, with the call tovars()
removed. - For
The subsequent arguments can be copied as is.
- For
For example:
df %>% mutate_if(is.numeric, ~mean(.x, na.rm = TRUE))
# ->
df %>% mutate(across(where(is.numeric), ~mean(.x, na.rm = TRUE)))
df %>% mutate_at(vars(c(x, starts_with("y"))), mean)
# ->
df %>% mutate(across(c(x, starts_with("y")), mean))
df %>% mutate_all(mean)
# ->
df %>% mutate(across(everything(), mean))
There are a few exceptions to this rule:
follow a different pattern. They already have select semantics, so are generally used in a different way that doesn’t have a direct equivalent withacross()
; use the newrename_with()
were paired with theall_vars()
helpers. The new helpersif_any()
can be used insidefilter()
to keep rows for which the predicate is true for at least one, or all selected columns:df <- tibble(x = c("a", "b"), y = c(1, 1), z = c(-1, 1)) # Find all rows where EVERY numeric variable is greater than zero df %>% filter(if_all(where(is.numeric), ~ .x > 0)) #> # A tibble: 1 × 3 #> x y z #> <chr> <dbl> <dbl> #> 1 b 1 1 # Find all rows where ANY numeric variable is greater than zero df %>% filter(if_any(where(is.numeric), ~ .x > 0)) #> # A tibble: 2 × 3 #> x y z #> <chr> <dbl> <dbl> #> 1 a 1 -1 #> 2 b 1 1
When used in a
, all transformations performed by anacross()
are applied at once. This is different to the behaviour ofmutate_if()
, andmutate_all()
, which apply the transformations one at a time. We expect that you’ll generally find the new behaviour less surprising:df <- tibble(x = 2, y = 4, z = 8) df %>% mutate_all(~ .x / y) #> # A tibble: 1 × 3 #> x y z #> <dbl> <dbl> <dbl> #> 1 0.5 1 8 df %>% mutate(across(everything(), ~ .x / y)) #> # A tibble: 1 × 3 #> x y z #> <dbl> <dbl> <dbl> #> 1 0.5 1 2