# dplyr

## Overview

dplyr is a grammar of data manipulation, providing a consistent set of
verbs that help you solve the most common data manipulation challenges:

- [`mutate()`](https://dplyr.tidyverse.org/reference/mutate.md) adds new
  variables that are functions of existing variables
- [`select()`](https://dplyr.tidyverse.org/reference/select.md) picks
  variables based on their names.
- [`filter()`](https://dplyr.tidyverse.org/reference/filter.md) picks
  cases based on their values.
- [`summarise()`](https://dplyr.tidyverse.org/reference/summarise.md)
  reduces multiple values down to a single summary.
- [`arrange()`](https://dplyr.tidyverse.org/reference/arrange.md)
  changes the ordering of the rows.

These all combine naturally with
[`group_by()`](https://dplyr.tidyverse.org/reference/group_by.md) which
allows you to perform any operation “by group”. You can learn more about
them in
[`vignette("dplyr")`](https://dplyr.tidyverse.org/articles/dplyr.md). As
well as these single-table verbs, dplyr also provides a variety of
two-table verbs, which you can learn about in
[`vignette("two-table")`](https://dplyr.tidyverse.org/articles/two-table.md).

If you are new to dplyr, the best place to start is the [data
transformation chapter](https://r4ds.hadley.nz/data-transform) in R for
Data Science.

## Backends

In addition to data frames/tibbles, dplyr makes working with other
computational backends accessible and efficient. Below is a list of
alternative backends:

- [arrow](https://arrow.apache.org/docs/r/) for larger-than-memory
  datasets, including on remote cloud storage like AWS S3, using the
  Apache Arrow C++ engine,
  [Acero](https://arrow.apache.org/docs/cpp/acero/overview.html).

- [dbplyr](https://dbplyr.tidyverse.org/) for data stored in a
  relational database. Translates your dplyr code to SQL.

- [dtplyr](https://dtplyr.tidyverse.org/) for large, in-memory datasets.
  Translates your dplyr code to high performance
  [data.table](https://rdatatable.gitlab.io/data.table/) code.

- [duckplyr](https://duckplyr.tidyverse.org/) for large, in-memory
  datasets. Translates your dplyr code to high performance
  [duckdb](https://duckdb.org) queries with zero extra copies and an
  automatic R fallback when translation isn’t possible.

- [sparklyr](https://spark.posit.co/) for very large datasets stored in
  [Apache Spark](https://spark.apache.org).

## Installation

``` r
# The easiest way to get dplyr is to install the whole tidyverse:
install.packages("tidyverse")

# Alternatively, install just dplyr:
install.packages("dplyr")
```

### Development version

To get a bug fix or to use a feature from the development version, you
can install the development version of dplyr from GitHub.

``` r
# install.packages("pak")
pak::pak("tidyverse/dplyr")
```

## Cheat Sheet

[![](https://raw.githubusercontent.com/rstudio/cheatsheets/main/pngs/thumbnails/data-transformation-cheatsheet-thumbs.png)](https://github.com/rstudio/cheatsheets/blob/main/data-transformation.pdf)

## Usage

``` r
library(dplyr)

starwars |>
  filter(species == "Droid")
#> # A tibble: 6 × 14
#>   name   height  mass hair_color skin_color  eye_color birth_year sex   gender  
#>   <chr>   <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> <chr>   
#> 1 C-3PO     167    75 <NA>       gold        yellow           112 none  masculi…
#> 2 R2-D2      96    32 <NA>       white, blue red               33 none  masculi…
#> 3 R5-D4      97    32 <NA>       white, red  red               NA none  masculi…
#> 4 IG-88     200   140 none       metal       red               15 none  masculi…
#> 5 R4-P17     96    NA none       silver, red red, blue         NA none  feminine
#> # ℹ 1 more row
#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

starwars |>
  select(name, ends_with("color"))
#> # A tibble: 87 × 4
#>   name           hair_color skin_color  eye_color
#>   <chr>          <chr>      <chr>       <chr>    
#> 1 Luke Skywalker blond      fair        blue     
#> 2 C-3PO          <NA>       gold        yellow   
#> 3 R2-D2          <NA>       white, blue red      
#> 4 Darth Vader    none       white       yellow   
#> 5 Leia Organa    brown      light       brown    
#> # ℹ 82 more rows

starwars |>
  mutate(name, bmi = mass / ((height / 100)^2)) |>
  select(name:mass, bmi)
#> # A tibble: 87 × 4
#>   name           height  mass   bmi
#>   <chr>           <int> <dbl> <dbl>
#> 1 Luke Skywalker    172    77  26.0
#> 2 C-3PO             167    75  26.9
#> 3 R2-D2              96    32  34.7
#> 4 Darth Vader       202   136  33.3
#> 5 Leia Organa       150    49  21.8
#> # ℹ 82 more rows

starwars |>
  arrange(desc(mass))
#> # A tibble: 87 × 14
#>   name      height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Jabba De…    175  1358 <NA>       green-tan… orange         600   herm… mascu…
#> 2 Grievous     216   159 none       brown, wh… green, y…       NA   male  mascu…
#> 3 IG-88        200   140 none       metal      red             15   none  mascu…
#> 4 Darth Va…    202   136 none       white      yellow          41.9 male  mascu…
#> 5 Tarfful      234   136 brown      brown      blue            NA   male  mascu…
#> # ℹ 82 more rows
#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

starwars |>
  group_by(species) |>
  summarise(
    n = n(),
    mass = mean(mass, na.rm = TRUE)
  ) |>
  filter(
    n > 1,
    mass > 50
  )
#> # A tibble: 9 × 3
#>   species      n  mass
#>   <chr>    <int> <dbl>
#> 1 Droid        6  69.8
#> 2 Gungan       3  74  
#> 3 Human       35  81.3
#> 4 Kaminoan     2  88  
#> 5 Mirialan     2  53.1
#> # ℹ 4 more rows
```

## Getting help

If you encounter a clear bug, please file an issue with a minimal
reproducible example on
[GitHub](https://github.com/tidyverse/dplyr/issues). For questions and
other discussion, please use [forum.posit.co](https://forum.posit.co/).

## Code of conduct

Please note that this project is released with a [Contributor Code of
Conduct](https://dplyr.tidyverse.org/CODE_OF_CONDUCT). By participating
in this project you agree to abide by its terms.

# Package index

## Data frame verbs

### Rows

Verbs that principally operate on rows.

- [`arrange()`](https://dplyr.tidyverse.org/reference/arrange.md) :
  Order rows using column values
- [`distinct()`](https://dplyr.tidyverse.org/reference/distinct.md) :
  Keep distinct/unique rows
- [`filter()`](https://dplyr.tidyverse.org/reference/filter.md)
  [`filter_out()`](https://dplyr.tidyverse.org/reference/filter.md) :
  Keep or drop rows that match a condition
- [`slice()`](https://dplyr.tidyverse.org/reference/slice.md)
  [`slice_head()`](https://dplyr.tidyverse.org/reference/slice.md)
  [`slice_tail()`](https://dplyr.tidyverse.org/reference/slice.md)
  [`slice_min()`](https://dplyr.tidyverse.org/reference/slice.md)
  [`slice_max()`](https://dplyr.tidyverse.org/reference/slice.md)
  [`slice_sample()`](https://dplyr.tidyverse.org/reference/slice.md) :
  Subset rows using their positions

### Columns

Verbs that principally operate on columns.

- [`glimpse`](https://dplyr.tidyverse.org/reference/glimpse.md) : Get a
  glimpse of your data
- [`mutate()`](https://dplyr.tidyverse.org/reference/mutate.md) :
  Create, modify, and delete columns
- [`pull()`](https://dplyr.tidyverse.org/reference/pull.md) : Extract a
  single column
- [`relocate()`](https://dplyr.tidyverse.org/reference/relocate.md) :
  Change column order
- [`rename()`](https://dplyr.tidyverse.org/reference/rename.md)
  [`rename_with()`](https://dplyr.tidyverse.org/reference/rename.md) :
  Rename columns
- [`select()`](https://dplyr.tidyverse.org/reference/select.md) : Keep
  or drop columns using their names and types

### Groups

Verbs that principally operate on groups of rows.

- [`count()`](https://dplyr.tidyverse.org/reference/count.md)
  [`tally()`](https://dplyr.tidyverse.org/reference/count.md)
  [`add_count()`](https://dplyr.tidyverse.org/reference/count.md)
  [`add_tally()`](https://dplyr.tidyverse.org/reference/count.md) :
  Count the observations in each group

- [`group_by()`](https://dplyr.tidyverse.org/reference/group_by.md)
  [`ungroup()`](https://dplyr.tidyverse.org/reference/group_by.md) :
  Group by one or more variables

- [`dplyr_by`](https://dplyr.tidyverse.org/reference/dplyr_by.md) :

  Per-operation grouping with `.by`/`by`

- [`rowwise()`](https://dplyr.tidyverse.org/reference/rowwise.md) :
  Group input by rows

- [`summarise()`](https://dplyr.tidyverse.org/reference/summarise.md)
  [`summarize()`](https://dplyr.tidyverse.org/reference/summarise.md) :
  Summarise each group down to one row

- [`reframe()`](https://dplyr.tidyverse.org/reference/reframe.md) :
  Transform each group to an arbitrary number of rows

- [`n()`](https://dplyr.tidyverse.org/reference/context.md)
  [`cur_group()`](https://dplyr.tidyverse.org/reference/context.md)
  [`cur_group_id()`](https://dplyr.tidyverse.org/reference/context.md)
  [`cur_group_rows()`](https://dplyr.tidyverse.org/reference/context.md)
  [`cur_column()`](https://dplyr.tidyverse.org/reference/context.md) :
  Information about the "current" group or variable

### Data frames

Verbs that principally operate on pairs of data frames.

- [`bind_cols()`](https://dplyr.tidyverse.org/reference/bind_cols.md) :
  Bind multiple data frames by column
- [`bind_rows()`](https://dplyr.tidyverse.org/reference/bind_rows.md) :
  Bind multiple data frames by row
- [`intersect()`](https://dplyr.tidyverse.org/reference/setops.md)
  [`union()`](https://dplyr.tidyverse.org/reference/setops.md)
  [`union_all()`](https://dplyr.tidyverse.org/reference/setops.md)
  [`setdiff()`](https://dplyr.tidyverse.org/reference/setops.md)
  [`setequal()`](https://dplyr.tidyverse.org/reference/setops.md)
  [`symdiff()`](https://dplyr.tidyverse.org/reference/setops.md) : Set
  operations
- [`inner_join()`](https://dplyr.tidyverse.org/reference/mutate-joins.md)
  [`left_join()`](https://dplyr.tidyverse.org/reference/mutate-joins.md)
  [`right_join()`](https://dplyr.tidyverse.org/reference/mutate-joins.md)
  [`full_join()`](https://dplyr.tidyverse.org/reference/mutate-joins.md)
  : Mutating joins
- [`nest_join()`](https://dplyr.tidyverse.org/reference/nest_join.md) :
  Nest join
- [`semi_join()`](https://dplyr.tidyverse.org/reference/filter-joins.md)
  [`anti_join()`](https://dplyr.tidyverse.org/reference/filter-joins.md)
  : Filtering joins
- [`cross_join()`](https://dplyr.tidyverse.org/reference/cross_join.md)
  : Cross join
- [`join_by()`](https://dplyr.tidyverse.org/reference/join_by.md) : Join
  specifications
- [`rows_insert()`](https://dplyr.tidyverse.org/reference/rows.md)
  [`rows_append()`](https://dplyr.tidyverse.org/reference/rows.md)
  [`rows_update()`](https://dplyr.tidyverse.org/reference/rows.md)
  [`rows_patch()`](https://dplyr.tidyverse.org/reference/rows.md)
  [`rows_upsert()`](https://dplyr.tidyverse.org/reference/rows.md)
  [`rows_delete()`](https://dplyr.tidyverse.org/reference/rows.md) :
  Manipulate individual rows

### Multiple columns

Pair these functions with
[`mutate()`](https://dplyr.tidyverse.org/reference/mutate.md),
[`summarise()`](https://dplyr.tidyverse.org/reference/summarise.md),
[`filter()`](https://dplyr.tidyverse.org/reference/filter.md), and
[`group_by()`](https://dplyr.tidyverse.org/reference/group_by.md) to
operate on multiple columns simultaneously.

- [`across()`](https://dplyr.tidyverse.org/reference/across.md)
  [`if_any()`](https://dplyr.tidyverse.org/reference/across.md)
  [`if_all()`](https://dplyr.tidyverse.org/reference/across.md) : Apply
  a function (or functions) across multiple columns
- [`c_across()`](https://dplyr.tidyverse.org/reference/c_across.md) :
  Combine values from multiple columns
- [`pick()`](https://dplyr.tidyverse.org/reference/pick.md) : Select a
  subset of columns

## Vector functions

Unlike other dplyr functions, these functions work on individual
vectors, not data frames.

- [`between()`](https://dplyr.tidyverse.org/reference/between.md) :
  Detect where values fall in a specified range

- [`case_when()`](https://dplyr.tidyverse.org/reference/case-and-replace-when.md)
  [`replace_when()`](https://dplyr.tidyverse.org/reference/case-and-replace-when.md)
  : A general vectorised if-else

- [`coalesce()`](https://dplyr.tidyverse.org/reference/coalesce.md) :
  Find the first non-missing element

- [`consecutive_id()`](https://dplyr.tidyverse.org/reference/consecutive_id.md)
  : Generate a unique identifier for consecutive combinations

- [`cumall()`](https://dplyr.tidyverse.org/reference/cumall.md)
  [`cumany()`](https://dplyr.tidyverse.org/reference/cumall.md)
  [`cummean()`](https://dplyr.tidyverse.org/reference/cumall.md) :
  Cumulative versions of any, all, and mean

- [`desc()`](https://dplyr.tidyverse.org/reference/desc.md) : Descending
  order

- [`if_else()`](https://dplyr.tidyverse.org/reference/if_else.md) :
  Vectorised if-else

- [`lag()`](https://dplyr.tidyverse.org/reference/lead-lag.md)
  [`lead()`](https://dplyr.tidyverse.org/reference/lead-lag.md) :
  Compute lagged or leading values

- [`n_distinct()`](https://dplyr.tidyverse.org/reference/n_distinct.md)
  : Count unique combinations

- [`na_if()`](https://dplyr.tidyverse.org/reference/na_if.md) :

  Convert values to `NA`

- [`near()`](https://dplyr.tidyverse.org/reference/near.md) : Compare
  two numeric vectors

- [`nth()`](https://dplyr.tidyverse.org/reference/nth.md)
  [`first()`](https://dplyr.tidyverse.org/reference/nth.md)
  [`last()`](https://dplyr.tidyverse.org/reference/nth.md) : Extract the
  first, last, or nth value from a vector

- [`ntile()`](https://dplyr.tidyverse.org/reference/ntile.md) :

  Bucket a numeric vector into `n` groups

- [`order_by()`](https://dplyr.tidyverse.org/reference/order_by.md) : A
  helper function for ordering window function output

- [`percent_rank()`](https://dplyr.tidyverse.org/reference/percent_rank.md)
  [`cume_dist()`](https://dplyr.tidyverse.org/reference/percent_rank.md)
  : Proportional ranking functions

- [`recode_values()`](https://dplyr.tidyverse.org/reference/recode-and-replace-values.md)
  [`replace_values()`](https://dplyr.tidyverse.org/reference/recode-and-replace-values.md)
  : Recode and replace values

- [`row_number()`](https://dplyr.tidyverse.org/reference/row_number.md)
  [`min_rank()`](https://dplyr.tidyverse.org/reference/row_number.md)
  [`dense_rank()`](https://dplyr.tidyverse.org/reference/row_number.md)
  : Integer ranking functions

- [`when_any()`](https://dplyr.tidyverse.org/reference/when-any-all.md)
  [`when_all()`](https://dplyr.tidyverse.org/reference/when-any-all.md)
  :

  Elementwise [`any()`](https://rdrr.io/r/base/any.html) and
  [`all()`](https://rdrr.io/r/base/all.html)

## Built in datasets

- [`band_members`](https://dplyr.tidyverse.org/reference/band_members.md)
  [`band_instruments`](https://dplyr.tidyverse.org/reference/band_members.md)
  [`band_instruments2`](https://dplyr.tidyverse.org/reference/band_members.md)
  : Band membership
- [`starwars`](https://dplyr.tidyverse.org/reference/starwars.md) :
  Starwars characters
- [`storms`](https://dplyr.tidyverse.org/reference/storms.md) : Storm
  tracks data

## Grouping helpers

This (mostly) experimental family of functions are used to manipulate
groups in various ways.

- [`group_cols()`](https://dplyr.tidyverse.org/reference/group_cols.md)
  : Select grouping variables
- [`group_map()`](https://dplyr.tidyverse.org/reference/group_map.md)
  [`group_modify()`](https://dplyr.tidyverse.org/reference/group_map.md)
  [`group_walk()`](https://dplyr.tidyverse.org/reference/group_map.md)
  **\[experimental\]** : Apply a function to each group
- [`group_trim()`](https://dplyr.tidyverse.org/reference/group_trim.md)
  **\[experimental\]** : Trim grouping structure

## Superseded

Superseded functions have been replaced by new approaches that we
believe to be superior, but we don’t want to force you to change until
you’re ready, so the existing functions will stay around for several
years.

- [`all_vars()`](https://dplyr.tidyverse.org/reference/all_vars.md)
  [`any_vars()`](https://dplyr.tidyverse.org/reference/all_vars.md)
  **\[superseded\]** : Apply predicate to all variables
- [`recode()`](https://dplyr.tidyverse.org/reference/recode.md)
  [`recode_factor()`](https://dplyr.tidyverse.org/reference/recode.md)
  **\[superseded\]** : Recode values
- [`sample_n()`](https://dplyr.tidyverse.org/reference/sample_n.md)
  [`sample_frac()`](https://dplyr.tidyverse.org/reference/sample_n.md)
  **\[superseded\]** : Sample n rows from a table
- [`scoped`](https://dplyr.tidyverse.org/reference/scoped.md)
  **\[superseded\]** : Operate on a selection of variables
- [`top_n()`](https://dplyr.tidyverse.org/reference/top_n.md)
  [`top_frac()`](https://dplyr.tidyverse.org/reference/top_n.md)
  **\[superseded\]** : Select top (or bottom) n rows (by value)
- [`vars()`](https://dplyr.tidyverse.org/reference/vars.md)
  **\[superseded\]** : Select variables
- [`with_groups()`](https://dplyr.tidyverse.org/reference/with_groups.md)
  **\[superseded\]** : Perform an operation with temporary groups

## Remote tables

- [`auto_copy()`](https://dplyr.tidyverse.org/reference/auto_copy.md) :
  Copy tables to same source, if necessary
- [`compute()`](https://dplyr.tidyverse.org/reference/compute.md)
  [`collect()`](https://dplyr.tidyverse.org/reference/compute.md)
  [`collapse()`](https://dplyr.tidyverse.org/reference/compute.md) :
  Force computation of a database query
- [`copy_to()`](https://dplyr.tidyverse.org/reference/copy_to.md) : Copy
  a local data frame to a remote src
- [`ident()`](https://dplyr.tidyverse.org/reference/ident.md) : Flag a
  character vector as SQL identifiers
- [`explain()`](https://dplyr.tidyverse.org/reference/explain.md)
  [`show_query()`](https://dplyr.tidyverse.org/reference/explain.md) :
  Explain details of a tbl
- [`tbl()`](https://dplyr.tidyverse.org/reference/tbl.md)
  [`is.tbl()`](https://dplyr.tidyverse.org/reference/tbl.md) : Create a
  table from a data source
- [`sql()`](https://dplyr.tidyverse.org/reference/sql.md) : SQL
  escaping.

# Articles

### Get started

- [Introduction to
  dplyr](https://dplyr.tidyverse.org/articles/dplyr.md):

  Start here if this is your first time using dplyr. You’ll learn the
  basic philosophy, the most important data manipulation verbs, and the
  pipe, `|>`, which allows you to combine multiple verbs together to
  solve real problems.

- [Grouped data](https://dplyr.tidyverse.org/articles/grouping.md):

  To unlock the full potential of dplyr, you need to understand how each
  verb interacts with grouping. This vignette shows you how to
  manipulate grouping, how each verb changes its behaviour when working
  with grouped data, and how you can access data about the “current”
  group from within a verb.

- [Two-table verbs](https://dplyr.tidyverse.org/articles/two-table.md):

  Most dplyr verbs work with a single data set, but most data analyses
  involve multiple datasets. This vignette introduces you to the dplyr
  verbs that work with more one than data set, and introduces to the
  mutating joins, filtering joins, and the set operations.

- [Recoding columns and replacing
  values](https://dplyr.tidyverse.org/articles/recoding-replacing.md):

- [dplyr \<-\> base R](https://dplyr.tidyverse.org/articles/base.md):

  How does dplyr compare to base R? This vignette describes the main
  differences in philosophy, and shows the base R code most closely
  equivalent to each dplyr verb.

### Automate

- [Column-wise
  operations](https://dplyr.tidyverse.org/articles/colwise.md):

  Learn how to easily repeat the same operation across multiple columns
  using [`across()`](https://dplyr.tidyverse.org/reference/across.md).

- [Row-wise
  operations](https://dplyr.tidyverse.org/articles/rowwise.md):

  In R, it’s usually easier to do something for each column than for
  each row. In this vignette you will learn how to use the
  [`rowwise()`](https://dplyr.tidyverse.org/reference/rowwise.md)
  function to perform operations by row. Along the way, you’ll learn
  about list-columns, and see how you might perform simulations and
  modelling within dplyr verbs.

- [Programming with
  dplyr](https://dplyr.tidyverse.org/articles/programming.md):

  Most dplyr verbs use “tidy evaluation”, a special type of non-standard
  evaluation. In this vignette, you’ll learn the two basic forms, data
  masking and tidy selection, and how you can program with them using
  either functions or for loops.

### Other

- [Window
  functions](https://dplyr.tidyverse.org/articles/window-functions.md):

  Window functions are a useful family of functions that work with
  vectors (returning an output the same size as the input), and combine
  naturally with
  [`mutate()`](https://dplyr.tidyverse.org/reference/mutate.md) and
  [`filter()`](https://dplyr.tidyverse.org/reference/filter.md).

- [Using dplyr in
  packages](https://dplyr.tidyverse.org/articles/in-packages.md):

  A guide for package authors who use dplyr.