Per-operation grouping with
There are two ways to group in dplyr:
Persistent grouping with
Per-operation grouping with
This help page is dedicated to explaining where and why you might want to use the latter.
Depending on the dplyr verb, the per-operation grouping argument may be named
The Supported verbs section below outlines this on a case-by-case basis.
The remainder of this page will refer to
.by for simplicity.
Grouping radically affects the computation of the dplyr verb you use it with, and one of the goals of
.by is to allow you to place that grouping specification alongside the code that actually uses it.
As an added benefit, with
.by you no longer need to remember to
summarise() won't ever message you about how it's handling the groups!
This idea comes from data.table, which allows you to specify
by alongside modifications in
dt[, .(x = mean(x)), by = g].
Note that some dplyr verbs use
by while others use
This is a purely technical difference.
|Grouping only affects a single verb||Grouping is persistent across multiple verbs|
|Selects variables with tidy-select||Computes expressions with data-masking|
|Summaries use existing order of group keys||Summaries sort group keys in ascending order|
Let's take a look at the two grouping approaches using this
expenses data set, which tracks costs accumulated across various
expenses <- tibble( id = c(1, 2, 1, 3, 1, 2, 3), region = c("A", "A", "A", "B", "B", "A", "A"), cost = c(25, 20, 19, 12, 9, 6, 6) ) expenses #> # A tibble: 7 x 3 #> id region cost #> <dbl> <chr> <dbl> #> 1 1 A 25 #> 2 2 A 20 #> 3 1 A 19 #> 4 3 B 12 #> 5 1 B 9 #> 6 2 A 6 #> 7 3 A 6
Imagine that you wanted to compute the average cost per region. You'd probably write something like this:
expenses %>% group_by(region) %>% summarise(cost = mean(cost)) #> # A tibble: 2 x 2 #> region cost #> <chr> <dbl> #> 1 A 15.2 #> 2 B 10.5
Instead, you can now specify the grouping inline within the verb:
expenses %>% summarise(cost = mean(cost), .by = region) #> # A tibble: 2 x 2 #> region cost #> <chr> <dbl> #> 1 A 15.2 #> 2 B 10.5
.by applies to a single operation, meaning that since
expenses was an ungrouped data frame, the result after applying
.by will also always be an ungrouped data frame, regardless of the number of grouping columns.
expenses %>% summarise(cost = mean(cost), .by = c(id, region)) #> # A tibble: 5 x 3 #> id region cost #> <dbl> <chr> <dbl> #> 1 1 A 22 #> 2 2 A 13 #> 3 3 B 12 #> 4 1 B 9 #> 5 3 A 6
Compare that with
group_by() %>% summarise(), where
summarise() generally peels off 1 layer of grouping by default, typically with a message that it is doing so:
expenses %>% group_by(id, region) %>% summarise(cost = mean(cost)) #> `summarise()` has grouped output by 'id'. You can override using the `.groups` #> argument. #> # A tibble: 5 x 3 #> # Groups: id  #> id region cost #> <dbl> <chr> <dbl> #> 1 1 A 22 #> 2 1 B 9 #> 3 2 A 13 #> 4 3 A 6 #> 5 3 B 12
.by grouping applies to a single operation, you don't need to worry about ungrouping, and it never needs to emit a message to remind you what it is doing with the groups.
Note that with
.by we specified multiple columns to group by using the tidy-select syntax
If you have a character vector of column names you'd like to group by, you can do so with
.by = all_of(my_cols).
It will group by the columns in the order they were provided.
To prevent surprising results, you can't use
.by on an existing grouped data frame:
expenses %>% group_by(id) %>% summarise(cost = mean(cost), .by = c(id, region)) #> Error in `summarise()`: #> ! Can't supply `.by` when `.data` is a grouped data frame.
So far we've focused on the usage of
.by works with a number of other dplyr verbs.
For example, you could append the mean cost per region onto the original data frame as a new column rather than computing a summary:
expenses %>% mutate(cost_by_region = mean(cost), .by = region) #> # A tibble: 7 x 4 #> id region cost cost_by_region #> <dbl> <chr> <dbl> <dbl> #> 1 1 A 25 15.2 #> 2 2 A 20 15.2 #> 3 1 A 19 15.2 #> 4 3 B 12 10.5 #> 5 1 B 9 10.5 #> 6 2 A 6 15.2 #> 7 3 A 6 15.2
Or you could slice out the maximum cost per combination of id and region:
When used with
slice() all maintain the ordering of the existing data.
This is different from
group_by(), which has always sorted the group keys in ascending order.
df <- tibble( month = c("jan", "jan", "feb", "feb", "mar"), temp = c(20, 25, 18, 20, 40) ) # Uses ordering by "first appearance" in the original data df %>% summarise(average_temp = mean(temp), .by = month) #> # A tibble: 3 x 2 #> month average_temp #> <chr> <dbl> #> 1 jan 22.5 #> 2 feb 19 #> 3 mar 40 # Sorts in ascending order df %>% group_by(month) %>% summarise(average_temp = mean(temp)) #> # A tibble: 3 x 2 #> month average_temp #> <chr> <dbl> #> 1 feb 19 #> 2 jan 22.5 #> 3 mar 40
If you need sorted group keys, we recommend that you explicitly use
arrange() either before or after the call to
This also gives you full access to all of
arrange()'s features, such as
desc() and the
If a dplyr verb doesn't support
.by, then that typically means that the verb isn't inherently affected by grouping.
rename() don't support
.by, because specifying columns to group by would not affect their implementations.
That said, there are a few exceptions to this where sometimes a dplyr verb doesn't support
.by, but does have special support for grouped data frames created by
This is typically because the verbs are required to retain the grouping columns, for example:
select()always retains grouping columns, with a message if any aren't specified in the
count()place unspecified grouping columns at the front of the data frame before computing their results.
.by_groupargument to optionally order by grouping columns first.
group_by() didn't exist, then these verbs would not have special support for grouped data frames.