A cube tbl stores data in a compact array format where dimension names are not needlessly repeated. They are particularly appropriate for experimental data where all combinations of factors are tried (e.g. complete factorial designs), or for storing the result of aggregations. Compared to data frames, they will occupy much less memory when variables are crossed, not nested.

tbl_cube(dimensions, measures)

Arguments

dimensions

A named list of vectors. A dimension is a variable whose values are known before the experiement is conducted; they are fixed by design (in reshape2 they are known as id variables). tbl_cubes are dense which means that almost every combination of the dimensions should have associated measurements: missing values require an explicit NA, so if the variables are nested, not crossed, the majority of the data structure will be empty. Dimensions are typically, but not always, categorical variables.

measures

A named list of arrays. A measure is something that is actually measured, and is not known in advance. The dimension of each array should be the same as the length of the dimensions. Measures are typically, but not always, continuous values.

Details

tbl_cube support is currently experimental and little performance optimisation has been done, but you may find them useful if your data already comes in this form, or you struggle with the memory overhead of the sparse/crossed of data frames. There is no support for hierarchical indices (although I think that would be a relatively straightforward extension to storing data frames for indices rather than vectors).

Implementation

Manipulation functions:

  • select() (M)

  • summarise() (M), corresponds to roll-up, but rather more limited since there are no hierarchies.

  • filter() (D), corresponds to slice/dice.

  • mutate() (M) is not implemented, but should be relatively straightforward given the implementation of summarise.

  • arrange() (D?) Not implemented: not obvious how much sense it would make

Joins: not implemented. See vignettes/joins.graffle for ideas. Probably straightforward if you get the indexes right, and that's probably some straightforward array/tensor operation.

See also

as.tbl_cube() for ways of coercing existing data structures into a tbl_cube.

Examples

# The built in nasa dataset records meterological data (temperature, # cloud cover, ozone etc) for a 4d spatio-temporal dataset (lat, long, # month and year) nasa
#> Source: local array [41,472 x 4] #> D: lat [dbl, 24] #> D: long [dbl, 24] #> D: month [int, 12] #> D: year [int, 6] #> M: cloudhigh [dbl] #> M: cloudlow [dbl] #> M: cloudmid [dbl] #> M: ozone [dbl] #> M: pressure [dbl] #> M: surftemp [dbl] #> M: temperature [dbl]
head(as.data.frame(nasa))
#> lat long month year cloudhigh cloudlow cloudmid ozone pressure #> 1 36.20000 -113.8 1 1995 26.0 7.5 34.5 304 835 #> 2 33.70435 -113.8 1 1995 20.0 11.5 32.5 304 940 #> 3 31.20870 -113.8 1 1995 16.0 16.5 26.0 298 960 #> 4 28.71304 -113.8 1 1995 13.0 20.5 14.5 276 990 #> 5 26.21739 -113.8 1 1995 7.5 26.0 10.5 274 1000 #> 6 23.72174 -113.8 1 1995 8.0 30.0 9.5 264 1000 #> surftemp temperature #> 1 272.7 272.1 #> 2 279.5 282.2 #> 3 284.7 285.2 #> 4 289.3 290.7 #> 5 292.2 292.7 #> 6 294.1 293.6
titanic <- as.tbl_cube(Titanic) head(as.data.frame(titanic))
#> Class Sex Age Survived Freq #> 1 1st Male Child No 0 #> 2 2nd Male Child No 0 #> 3 3rd Male Child No 35 #> 4 Crew Male Child No 0 #> 5 1st Female Child No 0 #> 6 2nd Female Child No 0
admit <- as.tbl_cube(UCBAdmissions) head(as.data.frame(admit))
#> Admit Gender Dept Freq #> 1 Admitted Male A 512 #> 2 Rejected Male A 313 #> 3 Admitted Female A 89 #> 4 Rejected Female A 19 #> 5 Admitted Male B 353 #> 6 Rejected Male B 207
as.tbl_cube(esoph, dim_names = 1:3)
#> Source: local array [96 x 3] #> D: agegp [ord, 6] #> D: alcgp [ord, 4] #> D: tobgp [ord, 4] #> M: ncases [dbl] #> M: ncontrols [dbl]
# Some manipulation examples with the NASA dataset -------------------------- # select() operates only on measures: it doesn't affect dimensions in any way select(nasa, cloudhigh:cloudmid)
#> Source: local array [41,472 x 4] #> D: lat [dbl, 24] #> D: long [dbl, 24] #> D: month [int, 12] #> D: year [int, 6] #> M: cloudhigh [dbl] #> M: cloudlow [dbl] #> M: cloudmid [dbl]
select(nasa, matches("temp"))
#> Source: local array [41,472 x 4] #> D: lat [dbl, 24] #> D: long [dbl, 24] #> D: month [int, 12] #> D: year [int, 6] #> M: surftemp [dbl] #> M: temperature [dbl]
# filter() operates only on dimensions filter(nasa, lat > 0, year == 2000)
#> Source: local array [4,320 x 4] #> D: lat [dbl, 15] #> D: long [dbl, 24] #> D: month [int, 12] #> D: year [int, 1] #> M: cloudhigh [dbl] #> M: cloudlow [dbl] #> M: cloudmid [dbl] #> M: ozone [dbl] #> M: pressure [dbl] #> M: surftemp [dbl] #> M: temperature [dbl]
# Each component can only refer to one dimensions, ensuring that you always # create a rectangular subset
# NOT RUN { filter(nasa, lat > long) # }
# Arrange is meaningless for tbl_cubes by_loc <- group_by(nasa, lat, long) summarise(by_loc, pressure = max(pressure), temp = mean(temperature))
#> Source: local array [576 x 2] #> D: lat [dbl, 24] #> D: long [dbl, 24] #> M: pressure [dbl] #> M: temp [dbl]