CRAN release: 2023-09-03
CRAN release: 2023-04-20
CRAN release: 2023-03-22
Mutating joins now warn about multiple matches much less often. At a high level, a warning was previously being thrown when a one-to-many or many-to-many relationship was detected between the keys of
y, but is now only thrown for a many-to-many relationship, which is much rarer and much more dangerous than one-to-many because it can result in a Cartesian explosion in the number of rows returned from the join (#6731, #6717).
We’ve accomplished this in two steps:
multiplenow defaults to
"all", and the options of
"warning"are now deprecated in favor of using
relationship(see below). We are using an accelerated deprecation process for these two options because they’ve only been available for a few weeks, and
relationshipis a clearly superior alternative.
The mutating joins gain a new
relationshipargument, allowing you to optionally enforce one of the following relationship constraints between the keys of
"many-to-one"enforces that each row in
xcan match at most 1 row in
y. If a row in
xmatches >1 rows in
y, an error is thrown. This option serves as the replacement for
multiple = "error".
The default behavior of
relationshipdoesn’t assume that there is any relationship between
y. However, for equality joins it will check for the presence of a many-to-many relationship, and will warn if it detects one.
This change unfortunately does mean that if you have set
multiple = "all"to avoid a warning and you happened to be doing a many-to-many style join, then you will need to replace
multiple = "all"with
relationship = "many-to-many"to silence the new warning, but we believe this should be rare since many-to-many relationships are fairly uncommon.
Fixed an issue where expressions involving infix operators had an abnormally large amount of overhead (#6681).
Joins now throw a more informative error when
ydoesn’t have the same source as
All major dplyr verbs now throw an informative error message if the input data frame contains a column named
R >=3.5.0 is now explicitly required. This is in line with the tidyverse policy of supporting the 5 most recent versions of R.
CRAN release: 2023-01-29
You can now write:
The most useful reason to do this is because
.byonly affects a single operation. In the example above, an ungrouped data frame went into the
summarise()call, so an ungrouped data frame will come out; with
.by, you never need to remember to
ungroup()afterwards and you never need to use the
.bywill never sort the results by the group key, unlike with
group_by(). Instead, the results are returned using the existing ordering of the groups from the original data. We feel this is more predictable, better maintains any ordering you might have already applied with a previous call to
arrange(), and provides a way to maintain the current ordering without having to resort to factors.
This feature was inspired by data.table, where the equivalent syntax looks like:
starwars[, .(mean_height = mean(height)), by = .(species, homeworld)]
reframe()has been added in response to valid concern from the community that allowing
summarise()to return any number of rows per group increases the chance for accidental bugs. We still feel that this is a powerful technique, and is a principled replacement for
do(), so we have moved these features to
group_by()now uses a new algorithm for computing groups. It is often faster than the previous approach (especially when there are many groups), and in most cases there should be no changes. The one exception is with character vectors, see the C locale news bullet below for more details (#4406, #6297).
Joins have been completely overhauled to enable more flexible join operations and provide more tools for quality control. Many of these changes are inspired by data.table’s join syntax (#5914, #5661, #5413, #2240).
A join specification can now be created through
join_by(). This allows you to specify both the left and right hand side of a join using unquoted column names, such as
join_by(sale_date == commercial_date). Join specifications can be supplied to any
*_join()function as the
Join specifications allow for new types of joins:
Equality joins: The most common join, specified by
==. For example,
join_by(sale_date == commercial_date).
Inequality joins: For joining on inequalities, i.e.
<=. For example, use
join_by(sale_date >= commercial_date)to find every commercial that aired before a particular sale.
Rolling joins: For “rolling” the closest match forward or backwards when there isn’t an exact match, specified by using the rolling helper,
closest(). For example,
join_by(closest(sale_date >= commercial_date))to find only the most recent commercial that aired before a particular sale.
Overlap joins: For detecting overlaps between sets of columns, specified by using one of the overlap helpers:
overlaps(). For example, use
join_by(between(commercial_date, sale_date_lower, sale_date))to find commercials that aired before a particular sale, as long as they occurred after some lower bound, such as 40 days before the sale was made.
Note that you cannot use arbitrary expressions in the join conditions, like
join_by(sale_date - 40 >= commercial_date). Instead, use
mutate()to create a new column containing the result of
sale_date - 40and refer to that by name in
multipleis a new argument for controlling what happens when a row in
xmatches multiple rows in
y. For equality joins and rolling joins, where this is usually surprising, this defaults to signalling a
"warning", but still returns all of the matches. For inequality joins, where multiple matches are usually expected, this defaults to returning
"all"of the matches. You can also return only the
"any"of the matches, or you can
keepnow defaults to
keep = FALSEfor equality conditions, but
keep = TRUEfor inequality conditions, since you generally want to preserve both sides of an inequality join.
unmatchedis a new argument for controlling what happens when a row would be dropped because it doesn’t have a match. For backwards compatibility, the default is
"drop", but you can also choose to
"error"if dropped rows would be surprising.
case_match()is a “vectorised switch” variant of
case_when()that matches on values rather than logical expressions. It is like a SQL “simple”
CASE WHENstatement, whereas
case_when()is like a SQL “searched”
CASE WHENstatement (#6328).
pick()makes it easy to access a subset of columns from the current group.
pick()is intended as a replacement for
across(.fns = NULL),
cur_data_all(). We feel that
pick()is a much more evocative name when you are just trying to select a subset of columns from your data (#6204).
group_by()now use the C locale, not the system locale, when ordering or grouping character vectors. This brings substantial performance improvements, increases reproducibility across R sessions, makes dplyr more consistent with data.table, and we believe it should affect little existing code. If it does affect your code, you can use
options(dplyr.legacy_locale = TRUE)to quickly revert to the previous behavior. However, in general, we instead recommend that you use the new
.localeargument to precisely specify the desired locale. For a full explanation please read the associated grouping and ordering tidyups.
if_all()now require the
.fnsarguments. In general, we now recommend that you use
pick()instead of an empty
across(c(x, y)). (#6523).
Relying on the previous default of
.cols = everything()is deprecated. We have skipped the soft-deprecation stage in this case, because indirect usage of
across()and friends in this way is rare.
Relying on the previous default of
.fns = NULLis not yet formally soft-deprecated, because there was no good alternative until now, but it is discouraged and will be soft-deprecated in the next minor release.
across()is soft-deprecated because it’s ambiguous when those arguments are evaluated. Now, instead of (e.g.)
across(a:b, mean, na.rm = TRUE)you should write
across(a:b, ~ mean(.x, na.rm = TRUE))(#6073).
progress_estimate()is deprecated for all uses (#6387).
All functions deprecated in 1.0.0 (released April 2020) and earlier now warn every time you use them (#6387). This includes
tbl_df(), and a handful of older arguments. They are likely to be made defunct in the next major version (but not before mid 2024).
slice()ing with a 1-column matrix is deprecated.
recode_factor()is superseded. We don’t have a direct replacement for it yet, but we plan to add one to forcats. In the meantime you can often use
case_match(.ptype = factor(levels = ))instead (#6433).
mutate()have moved from experimental to stable.
rows_*()family of functions have moved from experimental to stable.
Many of dplyr’s vector functions have been rewritten to make use of the vctrs package, bringing greater consistency and improved performance.
between()can now work with all vector types, not just numeric and date-time. Additionally,
rightcan now also be vectors (with the same length as
rightare cast to the common type before the comparison is made (#6183, #6260, #6478).
Has a new
.defaultargument that is intended to replace usage of
TRUE ~ default_valueas a more explicit and readable way to specify a default value. In the future, we will deprecate the unsafe recycling of the LHS inputs that allows
TRUE ~to work, so we encourage you to switch to using
No longer requires exact matching of the types of RHS values. For example, the following no longer requires you to use
Supports a larger variety of RHS value types. For example, you can use a data frame to create multiple columns at once.
.sizearguments which allow you to enforce a particular output type and size.
NULLinputs up front.
No longer iterates over the columns of data frame input. Instead, a row is now only coalesced if it is entirely missing, which is consistent with
vctrs::vec_detect_missing()and greatly simplifies the implementation.
.sizearguments which allow you to enforce a particular output type and size.
When used on a data frame, these functions now return a single row rather than a single column. This is more consistent with the vctrs principle that a data frame is generally treated as a vector of rows.
defaultis no longer “guessed”, and will always automatically be set to a missing value appropriate for the type of
nis not an integer.
nth(x, n = 2)is fine, but
nth(x, n = 2.5)is now an error.
No longer support indexing into scalar objects, like
<lm>or scalar S4 objects (#6670).
if_else()gains most of the same benefits as
case_when(). In particular,
if_else()now takes the common type of
missingto determine the output type, meaning that you can now reliably use
NA, rather than
NA_character_and friends (#6243).
if_else()also no longer allows you to supply
false, which was an undocumented usage that we consider to be off-label, because
falseare intended to be (and documented to be) vector inputs (#6730).
na_if()(#6329) now casts
yto the type of
xbefore comparison, which makes it clearer that this function is type and size stable on
x. In particular, this means that you can no longer do
na_if(<tibble>, 0), which previously accidentally allowed you to replace any instance of
0across every column of the tibble with
na_if()was never intended to work this way, and this is considered off-label usage.
You can also now replace
Fixed an issue with latest rlang that caused internal tools (such as
mask$eval_all_summarise()) to be mentioned in error messages (#6308).
Joins now reference the correct column in
ywhen a type error is thrown while joining on two columns with different names (#6465).
Joins on very wide tables are no longer bottlenecked by the application of
*_join()now error if you supply them with additional arguments that aren’t used (#6228).
Anonymous functions supplied with
\()are now inlined by
across()if possible, which slightly improves performance and makes possible further optimisations in the future.
dplyr no longer provides
tbl_sql. These methods have been accidentally overriding the
tbl_lazymethods that dbplyr provides, which has resulted in issues with the grouping structure of the output (#6338, tidyverse/dbplyr#940).
Warnings emitted inside
mutate()and variants are now collected and stashed away. Run the new
last_dplyr_warnings()function to see the warnings emitted within dplyr verbs during the last top-level command.
nest_join()has gained the
na_matchesargument that all other joins have.
nto be a single positive integer.
nto be an integer.
slice_*()generics now perform argument validation. This should make methods more consistent and simpler to implement (#6361).
slice_max()now consistently include missing values in the result if necessary (i.e. there aren’t enough non-missing values to reach the
propyou have selected). If you don’t want missing values to be included at all, set
na_rm = TRUE(#6177).
slice_sample()returns a data frame or group with the same number of rows as the input when
replace = FALSEand
nis larger than the number of rows or
propis larger than 1. This reverts a change made in 1.0.8, returning to the behavior of 1.0.7 (#6185)
CRAN release: 2022-04-28
rows_append()which works like
rows_insert()but ignores keys and allows you to insert arbitrary rows with a guarantee that the type of
xwon’t change (#6249, thanks to @krlmlr for the implementation and @mgirlich for the idea).
rows_*()functions no longer require that the key values in
xuniquely identify each row. Additionally,
rows_delete()no longer require that the key values in
yuniquely identify each row. Relaxing this restriction should make these functions more practically useful for data frames, and alternative backends can enforce this in other ways as needed (i.e. through primary keys) (#5553).
rows_insert()gained a new
conflictargument allowing you greater control over rows in
ywith keys that conflict with keys in
x. A conflict arises if a key in
yalready exists in
x. By default, a conflict results in an error, but you can now also
yrows. This is very similar to the
ON CONFLICT DO NOTHINGcommand from SQL (#5588, with helpful additions from @mgirlich and @krlmlr).
rows_delete()gained a new
unmatchedargument allowing you greater control over rows in
ywith keys that are unmatched by the keys in
x. By default, an unmatched key results in an error, but you can now also
yrows (#5984, #5699).
rows_delete()no longer requires that the columns of
ybe a strict subset of
x. Only the columns specified through
bywill be utilized from
y, all others will be dropped with a message.
rows_*()functions now always retain the column types of
x. This behavior was documented, but previously wasn’t being applied correctly (#6240).
rows_*()functions now fail elegantly if
yis a zero column data frame and
byisn’t specified (#6179).
CRAN release: 2022-02-08
Better display of error messages thanks to rlang 1.0.0.
mutate(.keep = "none")is no longer identical to
transmute()has not been changed, and completely ignores the column ordering of the existing data, instead relying on the ordering of expressions supplied through
mutate(.keep = "none")has been changed to ensure that pre-existing columns are never moved, which aligns more closely with the other
dplyr now uses
rlang::check_installed()to prompt you whether to install required packages that are missing.
CRAN release: 2021-06-18
CRAN release: 2021-05-05
across()now inlines lambda-formulas. This is slightly more performant and will allow more optimisations in the future.
Fixed quosure handling in
dplyr::group_by()that caused issues with extra arguments (tidyverse/lubridate#959).
row-wise data frames of 0 rows and list columns are supported again (#5804).
CRAN release: 2021-03-05
Using testthat 3rd edition.
CRAN release: 2021-02-02
CRAN release: 2021-01-15
Removed default fallbacks to lazyeval methods; this will yield better error messages when you call a dplyr function with the wrong input, and is part of our long term plan to remove the deprecated lazyeval interface.
Improved performance with many columns, with a dynamic data mask using active bindings and lazy chops (#5017).
dplyr now depends on R 3.3.0
CRAN release: 2020-08-18
CRAN release: 2020-07-31
tally()no longer automatically weights by column
nif present (#5298). dplyr 1.0.0 introduced this behaviour because of Hadley’s faulty memory. Historically
tally()automatically weighted and
count()did not, but this behaviour was accidentally changed in 0.8.2 (#4408) so that neither automatically weighted by
n. Since 0.8.2 is almost a year old, and the automatically weighting behaviour was a little confusing anyway, we’ve removed it from both
wt = n()is now deprecated; now just omit the
CRAN release: 2020-05-29
bind_cols()no longer converts to a tibble, returns a data frame if the input is a data frame.
Combining factor and character vectors silently creates a character vector; previously it created a character vector with a warning.
Combining multiple factors creates a factor with combined levels; previously it created a character vector with a warning.
Data frames, tibbles and grouped data frames are no longer considered equal, even if the data is the same.
Equality checks for data frames no longer ignore row order or groupings.
all.equal()internally. When comparing data frames, tests that used to pass may now fail.
distinct()keeps the original column order.
distinct()on missing columns now raises an error, it has been a compatibility warning for a long time.
group_modify()puts the grouping variable to the front.
Fix by prefixing with
dplyr::mutate(mtcars, x = dplyr::n())
The old data format for
grouped_dfis no longer supported. This may affect you if you have serialized grouped data frames to disk, e.g. with
saveRDS()or when using knitr caching.
Extending data frames requires that the extra class or classes are added first, not last. Having the extra class at the end causes some vctrs operations to fail with a message like:
Input must be a vector, not a `<data.frame/...>` object
right_join()no longer sorts the rows of the resulting tibble according to the order of the RHS
byargument in tibble
cur_group_rows()) provide a full set of options to you access information about the “current” group in dplyr verbs. They are inspired by data.table’s
rows_delete()) provide a new API to insert and delete rows from a second data frame or table. Support for updating mutable backends is planned (#4654).
rename()use the latest version of the tidyselect interface. Practically, this means that you can now combine selections using Boolean logic (i.e.
|), and use predicate functions with
where(is.character)) to select variables by type (#4680). It also makes it possible to use
rename()to repair data frames with duplicated names (#4615) and prevents you from accidentally introducing duplicate names (#4643). This also means that dplyr now re-exports
slice()gains a new set of helpers:
summarise()can create summaries of greater than length 1 if you use a summary function that returns multiple values.
.groups=argument to control the grouping structure.
mutate()(for data frames only), gains an experimental new argument called
.keepthat allows you to control which variables are kept from the input
.keep = "all"is the default; it keeps all variables.
.keep = "none"retains no input variables (except for grouping keys), so behaves like
.keep = "unused"keeps only variables not used to make new columns.
.keep = "used"keeps only the input variables used to create new columns; it’s useful for double checking your work (#3721).
c_across()that can be used inside
mutate()in row-wise data frames to easily (e.g.) compute a row-wise mean of all numeric variables. See
vignette("rowwise")for more details.
rowwise()is no longer questioning; we now understand that it’s an important tool when you don’t have vectorised code. It now also allows you to specify additional variables that should be preserved in the output when summarising (#4723). The rowwise-ness is preserved by all operations; you need to explicit drop it with
nest_by(). It has the same interface as
group_by(), but returns a rowwise data frame of grouping keys, supplemental with a list-column of data frames containing the rest of the data.
The implementation of all dplyr verbs have been changed to use primitives provided by the vctrs package. This makes it easier to add support for new types of vector, radically simplifies the implementation, and makes all dplyr verbs more consistent.
The place where you are mostly likely to be impacted by the coercion changes is when working with factors in joins or grouped mutates: now when combining factors with different levels, dplyr creates a new factor with the union of the levels. This matches base R more closely, and while perhaps strictly less correct, is much more convenient.
dplyr dropped its two heaviest dependencies: Rcpp and BH. This should make it considerably easier and faster to build from source.
The implementation of all verbs has been carefully thought through. This mostly makes implementation simpler but should hopefully increase consistency, and also makes it easier to adapt to dplyr to new data structures in the new future. Pragmatically, the biggest difference for most people will be that each verb documents its return value in terms of rows, columns, groups, and data frame attributes.
Row names are now preserved when working with data frames.
group_by()uses hashing from the
Grouped data frames now have
$<-methods that re-generate the underlying grouping. Note that modifying grouping variables in multiple steps (i.e.
df$grp1 <- 1; df$grp2 <- 1) will be inefficient since the data frame will be regrouped after each modification.
[.grouped_dfnow regroups to respect any grouping columns that have been removed (#4708).
- All deprecations now use the lifecycle, that means by default you’ll only see a deprecation warning once per session, and you can control with
options(lifecycle_verbosity = x)where
xis one of NULL, “quiet”, “warning”, and “error”.
id(), deprecated in dplyr 0.5.0, is now defunct.
failwith(), deprecated in dplyr 0.7.0, is now defunct.
nasahave been pulled out into a separate cubelyr package (#4429).
Use of pkgconfig for setting
na_matchesargument to join functions is now deprecated (#4914). This was rarely used, and I’m now confident that the default is correct for R.
dropargument has been deprecated because it didn’t actually affect the output.
eval_tbls2()are now deprecated. That were only used in a handful of packages, and we now believe that you’re better off performing comparisons more directly (#4675).
group_by(add = ): please use
group_by(.dots = )/
group_by_prepare(.dots = ): please use
src_local()has been deprecated; it was part of an approach to testing dplyr backends that didn’t pan out.
The scoped helpers (all functions ending in
_all) have been superseded by
across(). This dramatically reduces the API surface for dplyr, while at the same providing providing a more flexible and less error-prone interface (#4769).
select_*()have been superseded by
all_equal()is questioning; it solves a problem that no longer seems important.
rowwise()is no longer questioning.
vignette("programming")has been completely rewritten to reflect our latest vocabulary, the most recent rlang features, and our current recommendations. It should now be substantially easier to program with dplyr.
dplyr now has a rudimentary, experimental, and stop-gap, extension mechanism documented in
dplyr no longer provides a
all.equal.tbl_df()method. It never should have done so in the first place because it owns neither the generic nor the class. It also provided a problematic implementation because, by default, it ignored the order of the rows and the columns which is usually important. This is likely to cause new test failures in downstream packages; but on the whole we believe those failures to either reflect unexpected behaviour or tests that need to be strengthened (#2751).
keepargument so that you can optionally choose to keep both sets of join keys (#4589). This is useful when you want to figure out which rows were missing from either side.
Join functions can now perform a cross-join by specifying
by = character()(#4206.)
group_keys.rowwise_df()gives a 0 column data frame with
group_by(..., .add = TRUE)replaces
group_by(..., add = TRUE), with a deprecation message. The old argument name was a mistake because it prevents you from creating a new grouping var called
addand it violates our naming conventions (#4137).
count()now message if the default output
name(n), already exists in the data frame. To quiet the message, you’ll need to supply an explicit
name(#4284). You can override the default weighting to using a constant by setting
wt = 1.
starwarsdataset now does a better job of separating biological sex from gender identity. The previous
gendercolumn has been renamed to
sex, since it actually describes the individual’s biological sex. A new
gendercolumn encodes the actual gender identity using other information about the Star Wars universe (@MeganBeckett, #4456).
Better performance for extracting slices of factors and ordered factors (#4501).
CRAN release: 2020-03-07
- Maintenance release for compatibility with R-devel.
CRAN release: 2019-07-04
- Fixed performance regression introduced in version 0.8.2 (#4458).
CRAN release: 2019-06-29
top_frac(data, proportion)is a shorthand for
top_n(data, proportion * n())(#4017).
Using quosures in colwise verbs is deprecated (#4330).
*_if()functions correctly handle columns with special names (#4380).
colwise functions support constants in formulas (#4374).
group_split()always sets the
ptypeattribute, which make it more robust in the case where there are 0 groups.
CRAN release: 2019-05-14
Lists of formulas passed to colwise verbs are now automatically named.
Fixed handling of bare formulas in colwise verbs (#4183).
Support for R 3.1.* has been dropped. The minimal R version supported is now 3.2.0. https://www.tidyverse.org/articles/2019/04/r-version-support/
CRAN release: 2019-02-15
- Fixed integer C/C++ division, forced released by CRAN (#4185).
CRAN release: 2019-02-14
could not find function "n"or the warning
Calling `n()` without importing or prefixing it is deprecated, use `dplyr::n()`
The easiest fix is to import dplyr with
#' @import dplyrin a roxygen comment, alternatively such functions can be imported selectively as any other function with
importFrom(dplyr, n)in the
#' @importFrom dplyr nin a roxygen comment. The third option is to prefix them, i.e. use
If you see
checking S3 generic/method consistencyin R CMD check for your package, note that :
Error: `.data` is a corrupt grouped_df, ...signals code that makes wrong assumptions about the internals of a grouped data frame.
group_trim()drops unused levels of factors that are used as grouping variables.
group_walk()are purrr-like functions to iterate on groups of a grouped data frame, jointly identified by the data subset (exposed as
.x) and the data key (a one row tibble, exposed as
group_map()returns a grouped data frame that combines the results of the function,
group_walk()is only used for side effects and returns its input invisibly.
distinct_prepare(), previously known as
distinct_vars()is exported. This is mostly useful for alternative backends (e.g.
# 3 groups tibble( x = 1:2, f = factor(c("a", "b"), levels = c("a", "b", "c")) ) %>% group_by(f, .drop = FALSE) # the order of the grouping variables matter df <- tibble( x = c(1,2,1,2), f = factor(c("a", "b", "a", "b"), levels = c("a", "b", "c")) ) df %>% group_by(f, x, .drop = FALSE) df %>% group_by(x, f, .drop = FALSE)
The default behaviour drops the empty groups as in the previous versions.
.preserveargument to control which groups it should keep. The default
filter(.preserve = FALSE)recalculates the grouping structure based on the resulting data, otherwise it is kept as is.
The notion of lazily grouped data frames have disappeared. All dplyr verbs now recalculate immediately the grouping structure, and respect the levels of factors.
Subsets of columns now properly dispatch to the
[[method when the column is an object (a vector with a class) instead of making assumptions on how the column should be handled. The
[method must handle integer indices, including
x[NA_integer_]should produce a vector of the same class as
xwith whatever represents a missing value.
transmute_if()with grouped tibbles now informs you that the grouping variables are ignored. In the case of the
_all()verbs, the message invites you to use
mutate_at(df, vars(-group_cols()))(or the equivalent
transmute_at()call) instead if you’d like to make it explicit in your code that the operation is not applied on the grouping variables.
grouped data frames support
[, drop = TRUE](#3714).
Scoped filter variants now support functions and purrr-like lambdas:
R expressions that cannot be handled with native code are now evaluated with unwind-protection when available (on R 3.5 and later). This improves the performance of dplyr on data frames with many groups (and hence many expressions to evaluate). We benchmarked that computing a grouped average is consistently twice as fast with unwind-protection enabled.
Unwind-protection also makes dplyr more robust in corner cases because it ensures the C++ destructors are correctly called in all circumstances (debugger exit, captured condition, restart invocation).
Improved performance for wide tibbles (#3335).
Hybrid version of
sum(na.rm = FALSE)exits early when there are missing values. This considerably improves performance when there are missing values early in the vector (#3288).
The grouping metadata of grouped data frame has been reorganized in a single tidy tibble, that can be accessed with the new
group_data()function. The grouping tibble consists of one column per grouping variable, followed by a list column of the (1-based) indices of the groups. The new
group_rows()function retrieves that list of indices (#3489).
Hybrid evaluation has been completely redesigned for better performance and stability.
CRAN release: 2018-06-29
exprs()is no longer exported to avoid conflicts with
The MASS package is explicitly suggested to fix CRAN warnings on R-devel (#3657).
Fix rchk errors (#3693).
CRAN release: 2018-05-19
The major change in this version is that dplyr now depends on the selecting backend of the tidyselect package. If you have been linking to
dplyr::select_helpersdocumentation topic, you should update the link to point to
Another change that causes warnings in packages is that dplyr now exports the
exprs()function. This causes a collision with
Biobase::exprs(). Either import functions from dplyr selectively rather than in bulk, or do not import
Biobase::exprs()and refer to it with a namespace qualifier.
distinct(data, "string")now returns a one-row data frame again. (The previous behavior was to return the data unchanged.)
Fixed rare column name clash in
..._join()with non-join columns of the same name in both tables (#3266).
row_number()ordering to use the locale-dependent ordering functions in R when dealing with character vectors, rather than always using the C-locale ordering function in C (#2792, @foo-bar-baz-qux).
Summaries of summaries (such as
summarise(b = sum(a), c = sum(b))) are now computed using standard evaluation for simplicity and correctness, but slightly slower (#3233).
syms()are now exported.
syms()construct symbols from strings or character vectors. The
expr()variants are equivalent to
enquo()but return simple expressions rather than quosures. They support quasiquotation.
dplyr now depends on the new tidyselect package to power
pull()and their variants (#2896). Consequently
rename_vars()are soft-deprecated and will start issuing warnings in a future version.
Note that this only works in selecting functions because in other contexts strings and character vectors are ambiguous. For instance strings are a valid input in mutating operations and
mutate(df, "foo")creates a new column by recycling “foo” to the number of rows.
Hybrid evaluation simplifies
foo()(#3309). Hybrid functions can now be masked by regular R functions to turn off hybrid evaluation (#3255). The hybrid evaluator finds functions from dplyr even if dplyr is not attached (#3456).
Scoped select and rename functions (
rename_if()etc.) now work with grouped data frames, adapting the grouping as necessary (#2947, #3410).
group_by_at()can group by an existing grouping variable (#3351).
arrange_at()can use grouping variables (#3332).
transmute()no longer prints a message when including a group variable.
Better error message if dbplyr is not installed when accessing database backends (#3225).
Better error message in
..._join()when joining data frames with duplicate or
NAcolumn names. Joining such data frames with a semi- or anti-join now gives a warning, which may be converted to an error in future versions (#3243, #3417).
Dedicated error message when trying to use columns of the
Compute variable names for joins in R (#3430).
Bumped Rcpp dependency to 0.12.15 to avoid imperfect detection of
NAvalues in hybrid evaluation fixed in RcppCore/Rcpp#790 (#2919).
Avoid cleaning the data mask, a temporary environment used to evaluate expressions. If the environment, in which e.g. a
mutate()expression is evaluated, is preserved until after the operation, accessing variables from that environment now gives a warning but still returns
CRAN release: 2017-09-28
CRAN release: 2017-09-09
nth(default = var),
first(default = var)and
last(default = var)fall back to standard evaluation in a grouped operation instead of triggering an error (#3045).
Semi- and anti-joins now preserve the order of left-hand-side data frame (#3089).
Grouping by character vectors is now faster (#2204).
CRAN release: 2017-07-20
- Move build-time vs. run-time checks out of
CRAN release: 2017-06-22
Fix C++ error that caused compilation to fail on mac cran (#2862)
Quosured symbols do not prevent hybrid handling anymore. This should fix many performance issues introduced with tidyeval (#2822).
CRAN release: 2017-06-09
Five new datasets provide some interesting built-in datasets to demonstrate dplyr verbs (#2094):
starwarsdataset about starwars characters; has list columns
stormshas the trajectories of ~200 tropical storms
band_instruments2has some simple data to demonstrate joins.
as_tibble()is re-exported from tibble. This is the recommend way to create tibbles from existing data frames.
tbl_df()has been softly deprecated.
tribble()is now imported from tibble (#2336, @chrMongeau); this is now preferred to
dplyr no longer messages that you need dtplyr to work with data.table (#2489).
summarise_each_q()functions have been removed.
failwith(). I’m not even sure why it was here.
This version of dplyr includes some major changes to how database connections work. By and large, you should be able to continue using your existing dplyr database code without modification, but there are two big changes that you should be aware of:
Almost all database related code has been moved out of dplyr and into a new package, dbplyr. This makes dplyr simpler, and will make it easier to release fixes for bugs that only affect databases.
src_sqlite()will still live dplyr so your existing code continues to work.
It is no longer necessary to create a remote “src”. Instead you can work directly with the database connection returned by DBI. This reflects the maturity of the DBI ecosystem. Thanks largely to the work of Kirill Muller (funded by the R Consortium) DBI backends are now much more consistent, comprehensive, and easier to use. That means that there’s no longer a need for a layer in between you and DBI.
If you’ve implemented a database backend for dplyr, please read the backend news to see what’s changed from your perspective (not much). If you want to ensure your package works with both the current and previous version of dplyr, see
wrap_dbplyr_obj() for helpers.
Error messages and explanations of data frame inequality are now encoded in UTF-8, also on Windows (#2441).
Joins now always reencode character columns to UTF-8 if necessary. This gives a nice speedup, because now pointer comparison can be used instead of string comparison, but relies on a proper encoding tag for all strings (#2514).
group_vars()generic that returns the grouping as character vector, to avoid the potentially lossy conversion to language symbols. The list returned by
group_by_prepare()now has a new
group_namescomponent (#1950, #2384).
transmute()now have scoped variants (verbs suffixed with
summarise_if(), etc, these variants apply an operation to a selection of variables.
The scoped verbs taking predicates (
summarise_if(), etc) now support S3 objects and lazy tables. S3 objects should implement methods for
tbl_vars(). For lazy tables, the first 100 rows are collected and the predicate is applied on this subset of the data. This is robust for the common case of checking the type of a column (#2129).
Summarise and mutate colwise functions pass
...on to the manipulation functions.
dplyr has a new approach to non-standard evaluation (NSE) called tidyeval. It is described in detail in
vignette("programming") but, in brief, gives you the ability to interpolate values in contexts where dplyr usually works with expressions:
my_var <- quo(homeworld) starwars %>% group_by(!!my_var) %>% summarise_at(vars(height:mass), mean, na.rm = TRUE)
This means that the underscored version of each main verb is no longer needed, and so these functions have been deprecated (but remain around for backward compatibility).
sample_frac()now use tidyeval to capture their arguments by expression. This makes it possible to use unquoting idioms (see
vignette("programming")) and fixes scoping issues (#2297).
Most verbs taking dots now ignore the last argument if empty. This makes it easier to copy lines of code without having to worry about deleting trailing commas (#1039).
[API] The new
.envenvironments can be used inside all verbs that operate on data:
.data$column_nameaccesses the column
.env$varaccesses the external variable
var. Columns or external variables named
.envare shadowed, use
.env$...to access them. (
.dataimplements strict matching also for the
global()functions have been removed. They were never documented officially. Use the new
Expressions in verbs are now interpreted correctly in many cases that failed before (e.g., use of
case_when(), nonstandard evaluation, …). These expressions are now evaluated in a specially constructed temporary environment that retrieves column data on demand with the help of the
bindrcpppackage (#2190). This temporary environment poses restrictions on assignments using
<-inside verbs. To prevent leaking of broken bindings, the temporary environment is cleared after the evaluation (#2435).
xxx_join.tbl_df(na_matches = "never")treats all
NAvalues as different from each other (and from any other value), so that they never match. This corresponds to the behavior of joins for database sources, and of database joins in general. To match
na_matches = "na"to the join verbs; this is only supported for data frames. The default is
na_matches = "na", kept for the sake of compatibility to v0.5.0. It can be tweaked by calling
Anti- and semi-joins warn if factor levels are inconsistent (#2741).
Warnings about join column inconsistencies now contain the column names (#2728).
For selecting variables, the first selector decides if it’s an inclusive selection (i.e., the initial column list is empty), or an exclusive selection (i.e., the initial column list contains all columns). This means that
select(mtcars, contains("am"), contains("FOO"), contains("vs"))now returns again both
vscolumns like in dplyr 0.4.3 (#2275, #2289, @r2evans).
Select helpers now throw an error if called when no variables have been set (#2452)
copy_to()now returns its output invisibly (since you’re often just calling for the side-effect).
combine()are more strict when coercing. Logical values are no longer coerced to integer and numeric. Date, POSIXct and other integer or double-based classes are no longer coerced to integer or double as there is chance of attributes or information being lost (#2209, @zeehio).
bind_cols()now accept vectors. They are treated as rows by the former and columns by the latter. Rows require inner names like
c(col1 = 1, col2 = 2), while columns require outer names:
col1 = c(1, 2). Lists are still treated as data frames but can be spliced explicitly with
%in%gets new hybrid handler (#126).
Fixed segmentation faults in hybrid evaluation of
lag(). These functions now always fall back to the R implementation if called with arguments that the hybrid evaluator cannot handle (#948, #1980).
Many error messages are more helpful by referring to a column name or a position in the argument list (#2448).
tbl_vars()now has a
group_varsargument set to
TRUEby default. If
FALSE, group variables are not returned.
strictargument to control if an error is thrown when you try and rename a variable that doesn’t exist.
Fixed very rare case of false match during join (#2515).
dplyr now warns on load when the version of R or Rcpp during installation is different to the currently installed version (#2514).
Fixed rare error that could lead to a segmentation fault in
all_equal(ignore_col_order = FALSE)(#2502).
All operations that return tibbles now include the
"tbl"class. This is important for correct printing with tibble 1.3.1 (#2789).
Makeflags uses PKG_CPPFLAGS for defining preprocessor macros.
Update RStudio project settings to install tests (#1952).
Rcpp::interfaces()to register C callable interfaces, and registering all native exported functions via
useDynLib(.registration = TRUE)(#2146).
Formatting of grouped data frames now works by overriding the
tbl_sum()generic instead of
print(). This means that the output is more consistent with tibble, and that
format()is now supported also for SQL sources (#2781).
CRAN release: 2016-06-24
distinct()now only keeps the distinct variables. If you want to return all variables (using the first row for non-distinct values) use
.keep_all = TRUE(#1110). For SQL sources,
.keep_all = FALSEis implemented using
GROUP BY, and
.keep_all = TRUEraises an error (#1937, #1942, @krlmlr). (The default behaviour of using all variables when none are specified remains - this note only applies if you select some variables).
The select helper functions
ends_with()etc are now real exported functions. This means that you’ll need to import those functions if you’re using from a package where dplyr is not attached. i.e.
dplyr::select(mtcars, starts_with("m"))used to work, but now you’ll need
The long deprecated
%.%have been removed. Please use
Outdated benchmarking demos have been removed (#1487).
Code related to starting and signalling clusters has been moved out to multidplyr.
near(x, y)is a helper for
abs(x - y) < tol(#1607).
A new family of functions replace
mutate_each()(which will thus be deprecated in a future release).
mutate_all()apply a function to all columns while
mutate_at()operate on a subset of columns. These columns are selected with either a character vector of columns names, a numeric vector of column positions, or a column specification with
select()semantics generated by the new
columns()helper. In addition,
mutate_if()take a predicate function or a logical vector (these verbs currently require local sources). All these functions can now take ordinary functions instead of a list of functions generated by
funs()(though this is only useful for local sources). (#1845, @lionel-)
All data table related code has been separated out in to a new dtplyr package. This decouples the development of the data.table interface from the development of the dplyr package. If both data.table and dplyr are loaded, you’ll get a message reminding you to load dtplyr.
[[methods that never do partial matching (#1504), and throw an error if the variable does not exist.
all_equal()allows to compare data frames ignoring row and column order, and optionally ignoring minor differences in type (e.g. int vs. double) (#821). The test handles the case where the df has 0 columns (#1506). The test fails fails when convert is
FALSEand types don’t match (#1484).
The internals of
as_data_frame()have been aligned, so
as_data_frame()will now automatically recycle length-1 vectors. Both functions give more informative error messages if you attempting to create an invalid data frame. You can no longer create a data frame with duplicated names (#820). Both check for
POSIXltcolumns, and tell you to use
print.tbl_df()is considerably faster if you have very wide data frames. It will now also only list the first 100 additional variables not already on screen - control this with the new
print()(#1161). When printing a grouped data frame the number of groups is now printed with thousands separators (#1398). The type of list columns is correctly printed (#1379)
setOldClass(c("tbl_df", "tbl", "data.frame"))to help with S4 dispatch (#969).
tbl_dfautomatically generates column names (#1606).
tbl_cubes are now constructed correctly from data frames, duplicate dimension values are detected, missing dimension values are filled with
NA. The construction from data frames now guesses the measure variables by default, and allows specification of dimension and/or measure variables (#1568, @krlmlr).
Swap order of
matrix) for consistency with
as.tbl_cube.data.frame. Also, the
as.tbl_cube.tablenow defaults to
"Freq"for consistency with
The backend testing system has been improved. This lead to the removal of
temp_srcs(). In the unlikely event that you were using this function, you can instead use
src_memdb()is a session-local in-memory SQLite database.
data_frame(), but creates a new table in that database.
filter.tbl_sql()now puts parens around each argument (#934).
-is better translated (#1002).
escape.POSIXt()method makes it easier to use date times. The date is rendered in ISO 8601 format in UTC, which should work in most databases (#857).
This version includes an almost total rewrite of how dplyr verbs are translated into SQL. Previously, I used a rather ad-hoc approach, which tried to guess when a new subquery was needed. Unfortunately this approach was fraught with bugs, so in this version I’ve implemented a much richer internal data model. Now there is a three step process:
When applied to a
tbl_lazy, each dplyr verb captures its inputs and stores in a
op(short for operation) object.
sql_build()iterates through the operations building to build up an object that represents a SQL query. These objects are convenient for testing as they are lists, and are backend agnostics.
sql_render()iterates through the queries and generates the SQL, using generics (like
sql_select()) that can vary based on the backend.
In the short-term, this increased abstraction is likely to lead to some minor performance decreases, but the chance of dplyr generating correct SQL is much much higher. In the long-term, these abstractions will make it possible to write a query optimiser/compiler in dplyr, which would make it possible to generate much more succinct queries.
If you have written a dplyr backend, you’ll need to make some minor changes to your package:
sql_join()has been considerably simplified - it is now only responsible for generating the join query, not for generating the intermediate selects that rename the variable. Similarly for
sql_semi_join(). If you’ve provided new methods in your backend, you’ll need to rewrite.
select_query()gains a distinct argument which is used for generating queries for
distinct(). It loses the
offsetargument which was never used (and hence never tested).
src_translate_env()has been replaced by
sql_translate_env()which should have methods for the connection object.
There were two other tweaks to the exported API, but these are less likely to affect anyone.
partial_eval()got a new API: now use connection + variable names, rather than a
tbl. This makes testing considerably easier.
translate_sql_q()has been renamed to
Also note that the sql generation generics now have a default method, instead methods for DBIConnection and NULL.
select()now informs you that it adds missing grouping variables (#1511). It works even if the grouping variable has a non-syntactic name (#1138). Negating a failed match (e.g.
select(mtcars, -contains("x"))) returns all columns, instead of no columns (#1176)
The naming behaviour of
mutate_each()has been tweaked so that you can force inclusion of both the function and the variable name:
summarise_each(mtcars, funs(mean = mean), everything())(#442).
mutate()handles factors that are all
NA(#1645), or have different levels in different groups (#1414). It disambiguates
NaN(#1448), and silently promotes groups that only contain
NA(#1463). It deep copies data in list columns (#1643), and correctly fails on incompatible columns (#1641).
mutate()on a grouped data no longer groups grouping attributes (#1120).
rowwise()mutate gives expected results (#1381).
bind_rows()handles 0-length named lists (#1515), promotes factors to characters (#1538), and warns when binding factor and character (#1485). bind_rows()` is more flexible in the way it can accept data frames, lists, list of data frames, and list of lists (#1389).
Joins now use correct class when joining on
POSIXctcolumns (#1582, @joel23888), and consider time zones (#819). Joins handle a
bythat is empty (#1496), or has duplicates (#1192). Suffixes grow progressively to avoid creating repeated column names (#1460). Joins on string columns should be substantially faster (#1386). Extra attributes are ok if they are identical (#1636). Joins work correct when factor levels not equal (#1712, #1559). Anti- and semi-joins give correct result when by variable is a factor (#1571), but warn if factor levels are inconsistent (#2741). A clear error message is given for joins where an explicit
bycontains unavailable columns (#1928, #1932). Warnings about join column inconsistencies now contain the column names (#2728).
There were a number of fixes to enable joining of data frames that don’t have the same encoding of column names (#1513), including working around bug 16885 regarding
match()in R 3.3.0 (#1806, #1810, @krlmlr).
lag()received a considerable overhaul. They are more careful about more complicated expressions (#1588), and falls back more readily to pure R evaluation (#1411). They behave correctly in
summarise()(#1434). and handle default values for string columns.
n_distinct()uses multiple arguments for data frames (#1084), falls back to R evaluation when needed (#1657), reverting decision made in (#567). Passing no arguments gives an error (#1957, #1959, @krlmlr).
Hybrid evaluation leaves formulas untouched (#1447).
CRAN release: 2015-09-01
Until now, dplyr’s support for non-UTF8 encodings has been rather shaky. This release brings a number of improvement to fix these problems: it’s probably not perfect, but should be a lot better than the previously version. This includes fixes to
distinct() (#1179), and joins (#1315).
print.tbl_df() also received a fix for strings with invalid encodings (#851).
[.tbl_dfis more careful about subsetting column names (#1245).
orderedattribute of factors (#1112), and does better at comparing
POSIXcts (#1125). The
tzattribute is ignored when determining if two
POSIXctvectors are comparable. If the
tzof all inputs is the same, it’s used, otherwise its set to
print.grouped_df()now tells you how many groups there are.
mutate()on grouped data handles the special case where for the first few groups, the result consists of a
logicalvector with only
NA. This can happen when the condition of an
ifelseis an all
NAlogical vector (#958).
More explicit duplicated column name error message (#996).
Hybrid evaluation does not take place for objects with a class (#1237).
mutatecan set to
NULLthe first column (used to segfault, #1329).
filteron grouped data handles indices correctly (#880).
CRAN release: 2015-06-16
This is a minor release containing fixes for a number of crashes and issues identified by R CMD CHECK. There is one new “feature”: dplyr no longer complains about unrecognised attributes, and instead just copies them over to the output.
lead()for grouped data were confused about indices and therefore produced wrong results (#925, #937).
lag()once again overrides
lag()instead of just the default method
lag.default(). This is necessary due to changes in R CMD check. To use the lag function provided by another package, use
Fixed a number of memory issues identified by valgrind.
Improved performance when working with large number of columns (#879).
Lists-cols that contain data frames now print a slightly nicer summary (#1147)
Set operations give more useful error message on incompatible data frames (#903).
Workaround for using the constructor of
DataFrameon an unprotected object (#998)
Improved performance when working with large number of columns (#879).
CRAN release: 2015-01-08
bind_cols()efficiently bind a list of data frames by row or column.
combine()applies the same coercion rules to vectors (it works like
unlist()but is consistent with the
vignette("data_frames")describes dplyr functions that make it easier and faster to create and coerce data frames. It subsumes the old
vignette("two-table")describes how two-table verbs work in dplyr.
do()uses lazyeval to correctly evaluate its arguments in the correct environment (#744), and new
do_()is the SE equivalent of
do()(#718). You can modify grouped data in place: this is probably a bad idea but it’s sometimes convenient (#737).
do()on grouped data tables now passes in all columns (not all columns except grouping vars) (#735, thanks to @kismsu).
do()with database tables no longer potentially includes grouping variables twice (#673). Finally,
do()gives more consistent outputs when there are no rows or no groups (#625).
*_join(), you can now name only those variables that are different between the two tables, e.g.
inner_join(x, y, c("a", "b", "c" = "d"))(#682). If non-join columns are the same, dplyr will add
.ysuffixes to distinguish the source (#655).
select()now implements a more sophisticated algorithm so if you’re doing multiples includes and excludes with and without names, you’re more likely to get what you expect (#644). You’ll also get a better error message if you supply an input that doesn’t resolve to an integer column position (#643).
Printing has received a number of small tweaks. All
print()methods invisibly return their input so you can interleave
print()statements into a pipeline to see interim results.
print()will column names of 0 row data frames (#652), and will never print more 20 rows (i.e.
options(dplyr.print_max)is now 20), not 100 (#710). Row names are no never printed since no dplyr method is guaranteed to preserve them (#669).
type_sum()gains a data frame method.
dplyr now requires RSQLite >= 1.0. This shouldn’t affect your code in any way (except that RSQLite now doesn’t need to be attached) but does simplify the internals (#622).
Joining factors with the same levels in the same order preserves the original levels (#675). Joining factors with non-identical levels generates a warning and coerces to character (#684). Joining a character to a factor (or vice versa) generates a warning and coerces to character. Avoid these warnings by ensuring your data is compatible before joining.
rbind_list()will throw an error if you attempt to combine an integer and factor (#751).
rbind()ing a column full of
NAs is allowed and just collects the appropriate missing value for the column type being collected (#493).
summarise()is more careful about
NA, e.g. the decision on the result type will be delayed until the first non NA value is returned (#599). It will complain about loss of precision coercions, which can happen for expressions that return integers for some groups and a doubles for others (#599).
A number of functions gained new or improved hybrid handlers:
%in%(#126). That means when you use these functions in a dplyr verb, we handle them in C++, rather than calling back to R, and hence improving performance.
filterreturns its input when it has no rows or no columns (#782).
filter.data.table()works if the table has a variable called “V1” (#615).
*_join()keeps columns in original order (#684). Joining a factor to a character vector doesn’t segfault (#688).
*_joinfunctions can now deal with multiple encodings (#769), and correctly name results (#855).
*_join.data.table()works when data.table isn’t attached (#786).
group_by()on a data table preserves original order of the rows (#623).
group_by()supports variables with more than 39 characters thanks to a fix in lazyeval (#705). It gives meaningful error message when a variable is not found in the data frame (#716).
min(.,na.rm = TRUE)works with
Dates built on numeric vectors (#755).
cume_dist()handle data frames with 0 rows (#762). They all preserve missing values (#774).
row_number()doesn’t segfault when giving an external variable with the wrong number of variables (#781).
group_indiceshandles the edge case when there are no variables (#867).
NAs introduced by coercion to integer rangeon 32-bit Windows (#2708).
CRAN release: 2014-10-04
data_frame()by @kevinushey is a nicer way of creating data frames. It never coerces column types (no more
stringsAsFactors = FALSE!), never munges column names, and never adds row names. You can use previously defined columns to compute new columns (#376).
setdiff()now have methods for data frames, data tables and SQL database tables (#93). They pass their arguments down to the base functions, which will ensure they raise errors if you pass in two many arguments.
anti_join()) now allow you to join on different variables in
ytables by supplying a named vector to
by. For example,
by = c("a" = "b")joins
You can now program with dplyr - every function that does non-standard evaluation (NSE) has a standard evaluation (SE) version ending in
_. This is powered by the new lazyeval package which provides all the tools needed to implement NSE consistently and correctly.
vignette("nse")for full details.
regroup()is deprecated. Please use the more flexible
funs_qhas been replaced with
%.%has been deprecated: please use
chain()is defunct. (#518)
filter.numeric()removed. Need to figure out how to reimplement with new lazy eval system.
src_monetdb()is now implemented in MonetDB.R, not dplyr.
Main verbs now have individual documentation pages (#519).
Examples now use
hflightsbecause it the variables have better names and there are a few interlinked tables (#562).
nycflights13are (once again) suggested packages. This means many examples will not work unless you explicitly install them with
install.packages(c("Lahman", "nycflights13"))(#508). dplyr now depends on Lahman 3.0.1. A number of examples have been updated to reflect modified field names (#586).
group_by()has more consistent behaviour when grouping by constants: it creates a new column with that value (#410). It renames grouping variables (#410). The first argument is now
.dataso you can create new groups with name x (#534).
mutate(data, a = NULL)removes the variable
afrom the returned dataset (#462).
one_of()selector: this allows you to select variables provided by a character vector (#396). It fails immediately if you give an empty pattern to
matches()(#481, @leondutoit). Fixed buglet in
select()so that you can now create variables called
Switched from RC to R6.
renamehandles grouped data (#640).
The db backend system has been completely overhauled in order to make it possible to add backends in other packages, and to support a much wider range of databases. See
vignette("new-sql-backend")for instruction on how to create your own (#568).
order_by()now works in conjunction with window functions in databases that support them.
All verbs now understand how to work with
AsIs(#453) objects. They all check that colnames are unique (#483), and are more robust when columns are not present (#348, #569, #600).
Hybrid evaluation bugs fixed:
Call substitution stopped too early when a sub expression contained a
tbl_dfobjects instead of raw
LazySubsetwas confused about input data size (#452).
Improved handling of encoding for column names (#636).
Improved handling of hybrid evaluation re $ and @ (#645).
Fix major omission in
grouped_dt()methods - I was accidentally doing a deep copy on every result :(
joining two data.tables now correctly dispatches to data table methods, and result is a data table (#470)
summarise.tbl_cube()works with single grouping variable (#480).
CRAN release: 2014-05-21
dplyr now imports
%>% from magrittr (#330). I recommend that you use this instead of
%.% because it is easier to type (since you can hold down the shift key) and is more flexible. With you
%>%, you can control which argument on the RHS receives the LHS by using the pronoun
.. This makes
%>% more useful with base R functions because they don’t always take the data frame as the first argument. For example you could pipe
%>% xtabs( ~ cyl + vs, data = .)mtcars
Thanks to @smbache for the excellent magrittr package. dplyr only provides
%>% from magrittr, but it contains many other useful functions. To use them, load
library(magrittr). For more details, see
%.% will be deprecated in a future version of dplyr, but it won’t happen for a while. I’ve also deprecated
chain() to encourage a single style of dplyr usage: please use
do() has been completely overhauled. There are now two ways to use it, either with multiple named arguments or a single unnamed arguments.
do() is equivalent to
plyr::dlply, except it always returns a data frame.
If you use named arguments, each argument becomes a list-variable in the output. A list-variable can contain any arbitrary R object so it’s particularly well suited for storing models.
library(dplyr) <- mtcars %>% group_by(cyl) %>% do(lm = lm(mpg ~ wt, data = .)) models %>% summarise(rsq = summary(lm)$r.squared)models
If you use an unnamed argument, the result should be a data frame. This allows you to apply arbitrary functions to each group.
%>% group_by(cyl) %>% do(head(., 1))mtcars
Note the use of the
. pronoun to refer to the data in the current group.
do() also has an automatic progress bar. It appears if the computation takes longer than 5 seconds and lets you know (approximately) how much longer the job will take to complete.
dplyr 0.2 adds three new verbs:
glimpse()makes it possible to see all the columns in a tbl, displaying as much data for each variable as can be fit on a single line.
If you load plyr after dplyr, you’ll get a message suggesting that you load plyr first (#347).
group_by()now defaults to
add = FALSEso that it sets the grouping variables rather than adding to the existing list. I think this is how most people expected
group_byto work anyway, so it’s unlikely to cause problems (#385).
memoryvignette which discusses how dplyr minimises memory usage for local data frames (#198).
new-sql-backendvignette which discusses how to add a new SQL backend/source to dplyr.
changes()output more clearly distinguishes which columns were added or deleted.
explain()is now generic.
dplyr is more careful when setting the keys of data tables, so it never accidentally modifies an object that it doesn’t own. It also avoids unnecessary key setting which negatively affected performance. (#193, #255).
"comment"attribute is allowed (white listed) as well as names (#346).
hybrid versions of
na.rmargument (#168). This should yield substantial performance improvements for those functions.
Code adapted to Rcpp > 0.11.1
all.equal.data.framefrom base is no longer bypassed. we now have
copy_to.src_mysql()now works on windows (#323)
*_join()doesn’t reorder column names (#324).
rbind_all()is stricter and only accepts list of data frames (#288)
rbind_*propagates time zone information for
rbind_*is less strict about type promotion. The numeric
Collecterallows collection of integer and logical vectors. The integer
Collecteralso collects logical values (#321).
sumcorrectly handles integer (under/over)flow (#308).
join functions throw error instead of crashing when there are no common variables between the data frames, and also give a better error message when only one data frame has a by variable (#371).
SQL translation always evaluates subsetting operators (
[[) locally. (#318).
grouped_df_implfunction errors if there are no variables to group by (#398).
n_distinctdid not treat NA correctly in the numeric case #384.
Some compiler warnings triggered by -Wall or -pedantic have been eliminated.
group_byonly creates one group for NA (#401).
Hybrid evaluator did not evaluate expression in correct environment (#403).
CRAN release: 2014-03-15
rbind_list()now handle missing values in factors (#279).
SQL joins now work better if names duplicated in both x and y tables (#310).
Builds against Rcpp 0.11.1
Internal code is stricter when deciding if a data frame is grouped (#308): this avoids a number of situations which previously caused problems.
More data frame joins work with missing values in keys (#306).
CRAN release: 2014-02-24
select()is substantially more powerful. You can use named arguments to rename existing variables, and new functions
num_range()to select variables based on their names. It now also makes a shallow copy, substantially reducing its memory impact (#158, #172, #192, #232).
filter()now fails when given anything other than a logical vector, and correctly handles missing values (#249).
stats::filter()so you can continue to use
filter()function with numeric inputs (#264).
rbind_all()silently ignores data frames with 0 rows or 0 columns (#274).
Working towards Solaris compatibility.
Benchmarking vignette temporarily disabled due to microbenchmark problems reported by BDR.
CRAN release: 2014-01-29
benchmark-baseballvignette now contains fairer (including grouping times) comparisons with
filter()handles scalar results (#217) and better handles scoping, e.g.
variableis defined in the function that calls
filter. It also handles
Fas aliases to
FALSEif there are no
Fvariables in the data or in the scope.
select.grouped_dffails when the grouping variables are not included in the selected variables (#170)
all.equal.data.frame()handles a corner case where the data frame has
dplyr source package no longer includes pandas benchmark, reducing download size from 2.8 MB to 0.5 MB.