# dplyr ## Overview dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges: - [`mutate()`](https://dplyr.tidyverse.org/reference/mutate.md) adds new variables that are functions of existing variables - [`select()`](https://dplyr.tidyverse.org/reference/select.md) picks variables based on their names. - [`filter()`](https://dplyr.tidyverse.org/reference/filter.md) picks cases based on their values. - [`summarise()`](https://dplyr.tidyverse.org/reference/summarise.md) reduces multiple values down to a single summary. - [`arrange()`](https://dplyr.tidyverse.org/reference/arrange.md) changes the ordering of the rows. These all combine naturally with [`group_by()`](https://dplyr.tidyverse.org/reference/group_by.md) which allows you to perform any operation “by group”. You can learn more about them in [`vignette("dplyr")`](https://dplyr.tidyverse.org/articles/dplyr.md). As well as these single-table verbs, dplyr also provides a variety of two-table verbs, which you can learn about in [`vignette("two-table")`](https://dplyr.tidyverse.org/articles/two-table.md). If you are new to dplyr, the best place to start is the [data transformation chapter](https://r4ds.hadley.nz/data-transform) in R for Data Science. ## Backends In addition to data frames/tibbles, dplyr makes working with other computational backends accessible and efficient. Below is a list of alternative backends: - [arrow](https://arrow.apache.org/docs/r/) for larger-than-memory datasets, including on remote cloud storage like AWS S3, using the Apache Arrow C++ engine, [Acero](https://arrow.apache.org/docs/cpp/acero/overview.html). - [dbplyr](https://dbplyr.tidyverse.org/) for data stored in a relational database. Translates your dplyr code to SQL. - [dtplyr](https://dtplyr.tidyverse.org/) for large, in-memory datasets. Translates your dplyr code to high performance [data.table](https://rdatatable.gitlab.io/data.table/) code. - [duckplyr](https://duckplyr.tidyverse.org/) for large, in-memory datasets. Translates your dplyr code to high performance [duckdb](https://duckdb.org) queries with zero extra copies and an automatic R fallback when translation isn’t possible. - [sparklyr](https://spark.posit.co/) for very large datasets stored in [Apache Spark](https://spark.apache.org). ## Installation ``` r # The easiest way to get dplyr is to install the whole tidyverse: install.packages("tidyverse") # Alternatively, install just dplyr: install.packages("dplyr") ``` ### Development version To get a bug fix or to use a feature from the development version, you can install the development version of dplyr from GitHub. ``` r # install.packages("pak") pak::pak("tidyverse/dplyr") ``` ## Cheat Sheet [![](https://raw.githubusercontent.com/rstudio/cheatsheets/main/pngs/thumbnails/data-transformation-cheatsheet-thumbs.png)](https://github.com/rstudio/cheatsheets/blob/main/data-transformation.pdf) ## Usage ``` r library(dplyr) starwars |> filter(species == "Droid") #> # A tibble: 6 × 14 #> name height mass hair_color skin_color eye_color birth_year sex gender #> #> 1 C-3PO 167 75 gold yellow 112 none masculi… #> 2 R2-D2 96 32 white, blue red 33 none masculi… #> 3 R5-D4 97 32 white, red red NA none masculi… #> 4 IG-88 200 140 none metal red 15 none masculi… #> 5 R4-P17 96 NA none silver, red red, blue NA none feminine #> # ℹ 1 more row #> # ℹ 5 more variables: homeworld , species , films , #> # vehicles , starships starwars |> select(name, ends_with("color")) #> # A tibble: 87 × 4 #> name hair_color skin_color eye_color #> #> 1 Luke Skywalker blond fair blue #> 2 C-3PO gold yellow #> 3 R2-D2 white, blue red #> 4 Darth Vader none white yellow #> 5 Leia Organa brown light brown #> # ℹ 82 more rows starwars |> mutate(name, bmi = mass / ((height / 100)^2)) |> select(name:mass, bmi) #> # A tibble: 87 × 4 #> name height mass bmi #> #> 1 Luke Skywalker 172 77 26.0 #> 2 C-3PO 167 75 26.9 #> 3 R2-D2 96 32 34.7 #> 4 Darth Vader 202 136 33.3 #> 5 Leia Organa 150 49 21.8 #> # ℹ 82 more rows starwars |> arrange(desc(mass)) #> # A tibble: 87 × 14 #> name height mass hair_color skin_color eye_color birth_year sex gender #> #> 1 Jabba De… 175 1358 green-tan… orange 600 herm… mascu… #> 2 Grievous 216 159 none brown, wh… green, y… NA male mascu… #> 3 IG-88 200 140 none metal red 15 none mascu… #> 4 Darth Va… 202 136 none white yellow 41.9 male mascu… #> 5 Tarfful 234 136 brown brown blue NA male mascu… #> # ℹ 82 more rows #> # ℹ 5 more variables: homeworld , species , films , #> # vehicles , starships starwars |> group_by(species) |> summarise( n = n(), mass = mean(mass, na.rm = TRUE) ) |> filter( n > 1, mass > 50 ) #> # A tibble: 9 × 3 #> species n mass #> #> 1 Droid 6 69.8 #> 2 Gungan 3 74 #> 3 Human 35 81.3 #> 4 Kaminoan 2 88 #> 5 Mirialan 2 53.1 #> # ℹ 4 more rows ``` ## Getting help If you encounter a clear bug, please file an issue with a minimal reproducible example on [GitHub](https://github.com/tidyverse/dplyr/issues). For questions and other discussion, please use [forum.posit.co](https://forum.posit.co/). ## Code of conduct Please note that this project is released with a [Contributor Code of Conduct](https://dplyr.tidyverse.org/CODE_OF_CONDUCT). By participating in this project you agree to abide by its terms. # Package index ## Data frame verbs ### Rows Verbs that principally operate on rows. - [`arrange()`](https://dplyr.tidyverse.org/reference/arrange.md) : Order rows using column values - [`distinct()`](https://dplyr.tidyverse.org/reference/distinct.md) : Keep distinct/unique rows - [`filter()`](https://dplyr.tidyverse.org/reference/filter.md) [`filter_out()`](https://dplyr.tidyverse.org/reference/filter.md) : Keep or drop rows that match a condition - [`slice()`](https://dplyr.tidyverse.org/reference/slice.md) [`slice_head()`](https://dplyr.tidyverse.org/reference/slice.md) [`slice_tail()`](https://dplyr.tidyverse.org/reference/slice.md) [`slice_min()`](https://dplyr.tidyverse.org/reference/slice.md) [`slice_max()`](https://dplyr.tidyverse.org/reference/slice.md) [`slice_sample()`](https://dplyr.tidyverse.org/reference/slice.md) : Subset rows using their positions ### Columns Verbs that principally operate on columns. - [`glimpse`](https://dplyr.tidyverse.org/reference/glimpse.md) : Get a glimpse of your data - [`mutate()`](https://dplyr.tidyverse.org/reference/mutate.md) : Create, modify, and delete columns - [`pull()`](https://dplyr.tidyverse.org/reference/pull.md) : Extract a single column - [`relocate()`](https://dplyr.tidyverse.org/reference/relocate.md) : Change column order - [`rename()`](https://dplyr.tidyverse.org/reference/rename.md) [`rename_with()`](https://dplyr.tidyverse.org/reference/rename.md) : Rename columns - [`select()`](https://dplyr.tidyverse.org/reference/select.md) : Keep or drop columns using their names and types ### Groups Verbs that principally operate on groups of rows. - [`count()`](https://dplyr.tidyverse.org/reference/count.md) [`tally()`](https://dplyr.tidyverse.org/reference/count.md) [`add_count()`](https://dplyr.tidyverse.org/reference/count.md) [`add_tally()`](https://dplyr.tidyverse.org/reference/count.md) : Count the observations in each group - [`group_by()`](https://dplyr.tidyverse.org/reference/group_by.md) [`ungroup()`](https://dplyr.tidyverse.org/reference/group_by.md) : Group by one or more variables - [`dplyr_by`](https://dplyr.tidyverse.org/reference/dplyr_by.md) : Per-operation grouping with `.by`/`by` - [`rowwise()`](https://dplyr.tidyverse.org/reference/rowwise.md) : Group input by rows - [`summarise()`](https://dplyr.tidyverse.org/reference/summarise.md) [`summarize()`](https://dplyr.tidyverse.org/reference/summarise.md) : Summarise each group down to one row - [`reframe()`](https://dplyr.tidyverse.org/reference/reframe.md) : Transform each group to an arbitrary number of rows - [`n()`](https://dplyr.tidyverse.org/reference/context.md) [`cur_group()`](https://dplyr.tidyverse.org/reference/context.md) [`cur_group_id()`](https://dplyr.tidyverse.org/reference/context.md) [`cur_group_rows()`](https://dplyr.tidyverse.org/reference/context.md) [`cur_column()`](https://dplyr.tidyverse.org/reference/context.md) : Information about the "current" group or variable ### Data frames Verbs that principally operate on pairs of data frames. - [`bind_cols()`](https://dplyr.tidyverse.org/reference/bind_cols.md) : Bind multiple data frames by column - [`bind_rows()`](https://dplyr.tidyverse.org/reference/bind_rows.md) : Bind multiple data frames by row - [`intersect()`](https://dplyr.tidyverse.org/reference/setops.md) [`union()`](https://dplyr.tidyverse.org/reference/setops.md) [`union_all()`](https://dplyr.tidyverse.org/reference/setops.md) [`setdiff()`](https://dplyr.tidyverse.org/reference/setops.md) [`setequal()`](https://dplyr.tidyverse.org/reference/setops.md) [`symdiff()`](https://dplyr.tidyverse.org/reference/setops.md) : Set operations - [`inner_join()`](https://dplyr.tidyverse.org/reference/mutate-joins.md) [`left_join()`](https://dplyr.tidyverse.org/reference/mutate-joins.md) [`right_join()`](https://dplyr.tidyverse.org/reference/mutate-joins.md) [`full_join()`](https://dplyr.tidyverse.org/reference/mutate-joins.md) : Mutating joins - [`nest_join()`](https://dplyr.tidyverse.org/reference/nest_join.md) : Nest join - [`semi_join()`](https://dplyr.tidyverse.org/reference/filter-joins.md) [`anti_join()`](https://dplyr.tidyverse.org/reference/filter-joins.md) : Filtering joins - [`cross_join()`](https://dplyr.tidyverse.org/reference/cross_join.md) : Cross join - [`join_by()`](https://dplyr.tidyverse.org/reference/join_by.md) : Join specifications - [`rows_insert()`](https://dplyr.tidyverse.org/reference/rows.md) [`rows_append()`](https://dplyr.tidyverse.org/reference/rows.md) [`rows_update()`](https://dplyr.tidyverse.org/reference/rows.md) [`rows_patch()`](https://dplyr.tidyverse.org/reference/rows.md) [`rows_upsert()`](https://dplyr.tidyverse.org/reference/rows.md) [`rows_delete()`](https://dplyr.tidyverse.org/reference/rows.md) : Manipulate individual rows ### Multiple columns Pair these functions with [`mutate()`](https://dplyr.tidyverse.org/reference/mutate.md), [`summarise()`](https://dplyr.tidyverse.org/reference/summarise.md), [`filter()`](https://dplyr.tidyverse.org/reference/filter.md), and [`group_by()`](https://dplyr.tidyverse.org/reference/group_by.md) to operate on multiple columns simultaneously. - [`across()`](https://dplyr.tidyverse.org/reference/across.md) [`if_any()`](https://dplyr.tidyverse.org/reference/across.md) [`if_all()`](https://dplyr.tidyverse.org/reference/across.md) : Apply a function (or functions) across multiple columns - [`c_across()`](https://dplyr.tidyverse.org/reference/c_across.md) : Combine values from multiple columns - [`pick()`](https://dplyr.tidyverse.org/reference/pick.md) : Select a subset of columns ## Vector functions Unlike other dplyr functions, these functions work on individual vectors, not data frames. - [`between()`](https://dplyr.tidyverse.org/reference/between.md) : Detect where values fall in a specified range - [`case_when()`](https://dplyr.tidyverse.org/reference/case-and-replace-when.md) [`replace_when()`](https://dplyr.tidyverse.org/reference/case-and-replace-when.md) : A general vectorised if-else - [`coalesce()`](https://dplyr.tidyverse.org/reference/coalesce.md) : Find the first non-missing element - [`consecutive_id()`](https://dplyr.tidyverse.org/reference/consecutive_id.md) : Generate a unique identifier for consecutive combinations - [`cumall()`](https://dplyr.tidyverse.org/reference/cumall.md) [`cumany()`](https://dplyr.tidyverse.org/reference/cumall.md) [`cummean()`](https://dplyr.tidyverse.org/reference/cumall.md) : Cumulative versions of any, all, and mean - [`desc()`](https://dplyr.tidyverse.org/reference/desc.md) : Descending order - [`if_else()`](https://dplyr.tidyverse.org/reference/if_else.md) : Vectorised if-else - [`lag()`](https://dplyr.tidyverse.org/reference/lead-lag.md) [`lead()`](https://dplyr.tidyverse.org/reference/lead-lag.md) : Compute lagged or leading values - [`n_distinct()`](https://dplyr.tidyverse.org/reference/n_distinct.md) : Count unique combinations - [`na_if()`](https://dplyr.tidyverse.org/reference/na_if.md) : Convert values to `NA` - [`near()`](https://dplyr.tidyverse.org/reference/near.md) : Compare two numeric vectors - [`nth()`](https://dplyr.tidyverse.org/reference/nth.md) [`first()`](https://dplyr.tidyverse.org/reference/nth.md) [`last()`](https://dplyr.tidyverse.org/reference/nth.md) : Extract the first, last, or nth value from a vector - [`ntile()`](https://dplyr.tidyverse.org/reference/ntile.md) : Bucket a numeric vector into `n` groups - [`order_by()`](https://dplyr.tidyverse.org/reference/order_by.md) : A helper function for ordering window function output - [`percent_rank()`](https://dplyr.tidyverse.org/reference/percent_rank.md) [`cume_dist()`](https://dplyr.tidyverse.org/reference/percent_rank.md) : Proportional ranking functions - [`recode_values()`](https://dplyr.tidyverse.org/reference/recode-and-replace-values.md) [`replace_values()`](https://dplyr.tidyverse.org/reference/recode-and-replace-values.md) : Recode and replace values - [`row_number()`](https://dplyr.tidyverse.org/reference/row_number.md) [`min_rank()`](https://dplyr.tidyverse.org/reference/row_number.md) [`dense_rank()`](https://dplyr.tidyverse.org/reference/row_number.md) : Integer ranking functions - [`when_any()`](https://dplyr.tidyverse.org/reference/when-any-all.md) [`when_all()`](https://dplyr.tidyverse.org/reference/when-any-all.md) : Elementwise [`any()`](https://rdrr.io/r/base/any.html) and [`all()`](https://rdrr.io/r/base/all.html) ## Built in datasets - [`band_members`](https://dplyr.tidyverse.org/reference/band_members.md) [`band_instruments`](https://dplyr.tidyverse.org/reference/band_members.md) [`band_instruments2`](https://dplyr.tidyverse.org/reference/band_members.md) : Band membership - [`starwars`](https://dplyr.tidyverse.org/reference/starwars.md) : Starwars characters - [`storms`](https://dplyr.tidyverse.org/reference/storms.md) : Storm tracks data ## Grouping helpers This (mostly) experimental family of functions are used to manipulate groups in various ways. - [`group_cols()`](https://dplyr.tidyverse.org/reference/group_cols.md) : Select grouping variables - [`group_map()`](https://dplyr.tidyverse.org/reference/group_map.md) [`group_modify()`](https://dplyr.tidyverse.org/reference/group_map.md) [`group_walk()`](https://dplyr.tidyverse.org/reference/group_map.md) **\[experimental\]** : Apply a function to each group - [`group_trim()`](https://dplyr.tidyverse.org/reference/group_trim.md) **\[experimental\]** : Trim grouping structure ## Superseded Superseded functions have been replaced by new approaches that we believe to be superior, but we don’t want to force you to change until you’re ready, so the existing functions will stay around for several years. - [`all_vars()`](https://dplyr.tidyverse.org/reference/all_vars.md) [`any_vars()`](https://dplyr.tidyverse.org/reference/all_vars.md) **\[superseded\]** : Apply predicate to all variables - [`recode()`](https://dplyr.tidyverse.org/reference/recode.md) [`recode_factor()`](https://dplyr.tidyverse.org/reference/recode.md) **\[superseded\]** : Recode values - [`sample_n()`](https://dplyr.tidyverse.org/reference/sample_n.md) [`sample_frac()`](https://dplyr.tidyverse.org/reference/sample_n.md) **\[superseded\]** : Sample n rows from a table - [`scoped`](https://dplyr.tidyverse.org/reference/scoped.md) **\[superseded\]** : Operate on a selection of variables - [`top_n()`](https://dplyr.tidyverse.org/reference/top_n.md) [`top_frac()`](https://dplyr.tidyverse.org/reference/top_n.md) **\[superseded\]** : Select top (or bottom) n rows (by value) - [`vars()`](https://dplyr.tidyverse.org/reference/vars.md) **\[superseded\]** : Select variables - [`with_groups()`](https://dplyr.tidyverse.org/reference/with_groups.md) **\[superseded\]** : Perform an operation with temporary groups ## Remote tables - [`auto_copy()`](https://dplyr.tidyverse.org/reference/auto_copy.md) : Copy tables to same source, if necessary - [`compute()`](https://dplyr.tidyverse.org/reference/compute.md) [`collect()`](https://dplyr.tidyverse.org/reference/compute.md) [`collapse()`](https://dplyr.tidyverse.org/reference/compute.md) : Force computation of a database query - [`copy_to()`](https://dplyr.tidyverse.org/reference/copy_to.md) : Copy a local data frame to a remote src - [`ident()`](https://dplyr.tidyverse.org/reference/ident.md) : Flag a character vector as SQL identifiers - [`explain()`](https://dplyr.tidyverse.org/reference/explain.md) [`show_query()`](https://dplyr.tidyverse.org/reference/explain.md) : Explain details of a tbl - [`tbl()`](https://dplyr.tidyverse.org/reference/tbl.md) [`is.tbl()`](https://dplyr.tidyverse.org/reference/tbl.md) : Create a table from a data source - [`sql()`](https://dplyr.tidyverse.org/reference/sql.md) : SQL escaping. # Articles ### Get started - [Introduction to dplyr](https://dplyr.tidyverse.org/articles/dplyr.md): Start here if this is your first time using dplyr. You’ll learn the basic philosophy, the most important data manipulation verbs, and the pipe, `|>`, which allows you to combine multiple verbs together to solve real problems. - [Grouped data](https://dplyr.tidyverse.org/articles/grouping.md): To unlock the full potential of dplyr, you need to understand how each verb interacts with grouping. This vignette shows you how to manipulate grouping, how each verb changes its behaviour when working with grouped data, and how you can access data about the “current” group from within a verb. - [Two-table verbs](https://dplyr.tidyverse.org/articles/two-table.md): Most dplyr verbs work with a single data set, but most data analyses involve multiple datasets. This vignette introduces you to the dplyr verbs that work with more one than data set, and introduces to the mutating joins, filtering joins, and the set operations. - [Recoding columns and replacing values](https://dplyr.tidyverse.org/articles/recoding-replacing.md): - [dplyr \<-\> base R](https://dplyr.tidyverse.org/articles/base.md): How does dplyr compare to base R? This vignette describes the main differences in philosophy, and shows the base R code most closely equivalent to each dplyr verb. ### Automate - [Column-wise operations](https://dplyr.tidyverse.org/articles/colwise.md): Learn how to easily repeat the same operation across multiple columns using [`across()`](https://dplyr.tidyverse.org/reference/across.md). - [Row-wise operations](https://dplyr.tidyverse.org/articles/rowwise.md): In R, it’s usually easier to do something for each column than for each row. In this vignette you will learn how to use the [`rowwise()`](https://dplyr.tidyverse.org/reference/rowwise.md) function to perform operations by row. Along the way, you’ll learn about list-columns, and see how you might perform simulations and modelling within dplyr verbs. - [Programming with dplyr](https://dplyr.tidyverse.org/articles/programming.md): Most dplyr verbs use “tidy evaluation”, a special type of non-standard evaluation. In this vignette, you’ll learn the two basic forms, data masking and tidy selection, and how you can program with them using either functions or for loops. ### Other - [Window functions](https://dplyr.tidyverse.org/articles/window-functions.md): Window functions are a useful family of functions that work with vectors (returning an output the same size as the input), and combine naturally with [`mutate()`](https://dplyr.tidyverse.org/reference/mutate.md) and [`filter()`](https://dplyr.tidyverse.org/reference/filter.md). - [Using dplyr in packages](https://dplyr.tidyverse.org/articles/in-packages.md): A guide for package authors who use dplyr.