Understanding [.data.frame
Posted on January 24, 2015Welcome back! Last post,
I looked at head.default
and tail.default
, and gave them
a little bit of a hard time, but ultimately there’s nothing
seriously wrong with their implementations. Today, we get to
have a lot more fun. Fasten your seatbelts, because this will
be a bumpy ride.
The usual preamble: [.data.frame
is very useful, nice and
terse, and ‘makes sense’ once you adopt the R
mindset.
Get rows with df[x, ]
, get column(s?) with df[, x]
, or
get columns with just plain old df[x]
. Seems intuitive,
on the face of it.
Intuitively, [.data.frame
seems like it shouldn’t be too
hard to implement. For the numeric types, we’re trying to
get rows or columns by index; for logical types, we get the
rows or columns for which the vector elements are TRUE
;
for character vetors, we match on names
or row.names
.
In addition, data.frame
s should only contain atomic
vectors – it is possible to stuff other objects in, but
that is usually quite rare and prone to break in other,
surprising, ways, and so is generally discouraged.
Seems like there isn’t that much room for ugliness, right?
We’ll start with one bit that is really subtle. We’ll call
the subsetting indices as df[i, j]
; that is, i
denotes
a vector used for row subsetting, while j
denotes a vector
used for column subsetting. Why is it
that the following two statements can give different results?
df[x, ]
df[x]
In the first case, we are explicitly setting the j
argument
to be missing; in the second case, it is implicitly missing.
It seems surprising, but [.data.frame
does an ad-hoc dispatch
based on the number of implicitly missing arguments. How does it
do that?
Can we use the missing()
function to help us out here?
f <- function(i, j) cbind(missing(i), missing(j))
rbind(
f(1), ## implicitly missing
f(1,) ## explicitly missing
)
[,1] [,2]
[1,] FALSE TRUE
[2,] FALSE TRUE
Well, that’s not helpful – R
still believes that the j
argument is missing in each case (which it is). But how do we differentiate
between this notion of implicit and explicit missingness?
Let’s look at the first few lines of [.data.frame
to find
out what really happens behind the scenes.
function (x, i, j, drop = if (missing(i)) TRUE else length(cols) ==
1)
{
mdrop <- missing(drop)
Narg <- nargs() - (!mdrop)
has.j <- !missing(j)
if (!all(names(sys.call()) %in% c("", "drop")) && !isS4(x))
warning("named arguments other than 'drop' are discouraged")
if (Narg < 3L) {
< ... snip ... >
This is made more complicated by the presence of the drop
argument, so try to ignore that bit. What R
actually does
is ‘dispatches’ based on the number of args. Let’s see:
f <- function(x, y) nargs()
rbind(f(), ## nothing passed
f(1), ## single argument
f(1,), ## argument + explicit missing
f(,)) ## both explicit missing
[,1]
[1,] 0
[2,] 1
[3,] 2
[4,] 2
In other words, nargs()
does not include
implicitly missing arguments when it counts the number
of arguments passed to the function.
Notice that nargs()
in f()
returns zero
(all arguments are implicitly missing), while nargs()
in
f(,)
returns two (all arguments are explicitly missing;
it’s as though we had really called f(<missing>, <missing>)
).
Okay, so we understand how [.data.frame
can discriminate
between df[i]
and df[i, ]
, because nargs()
gives us
a funny way to poke at that. Now, let’s look at the first
branch, where Nargs < 3
– implying a df[i]
call.
From here on, we’re going to start walking line-by-line
through the code in [.data.frame
. Just for posterity,
this is the version of R
I’m running – it’s possible
that [.data.frame
might look slightly different on your
computer:
Session info
------------
setting value
version R version 3.1.2 (2014-10-31)
system x86_64, darwin13.4.0
ui RStudio (0.98.1091)
language (EN)
collate en_CA.UTF-8
tz America/Vancouver
I’ll write my own comments above the code used for
[.data.frame
, explaining what’s going on each step of
the way.
## This check implies the call is of the form
## `df[i]` or `df[i, drop = .`, so we want to do
## some kind of column subsetting.
if (Narg < 3L) {
## Ignore 'drop' -- when doing single-element
## subsetting, we always return a data.frame
if (!mdrop)
warning("'drop' argument will be ignored")
## If there was no `i` argument, implying a call
## like `df[]`, just return `df`.
if (missing(i))
return(x)
## If `i` is a matrix, convert `x` to a matrix
## (!?) and then subset that matrix with `i`. I
## still do not understand matrix subsetting of
## data.frames (or matrix subsetting of
## matrices, for that matter...)
if (is.matrix(i))
return(as.matrix(x)[i])
## Get the names, and ensure it's a character
## vector
nm <- names(x)
if (is.null(nm))
nm <- character()
## If i is not a character (ie, numeric or
## logical subsetting...), AND one (or more) of
## the names is NA...
if (!is.character(i) && anyNA(nm)) {
## Replace the names with indices (1 to ncol(x))
names(nm) <- names(x) <- seq_along(x)
## Call the next method after `[.data.frame`. In
## this case, that implies calling an internal
## generic; this effectively amounts to calling
## `as.list(x)[i]`, but this gets done in internal
## C code (there is no `[.list` S3 method
## registered). This is the joy of 'internal
## generics'! The intention of this code is likely
## to avoid having to actually call `as.list(x)`,
## which could force a deep copy.
##
## NB: As Hadley pointed out, this could simply
## have been `.subset2()`, which would have been
## more readable (and, in fact, is used in other
## parts of this function anyway!)
y <- NextMethod("[")
## Get and validate the names
cols <- names(y)
if (anyNA(cols))
stop("undefined columns selected")
cols <- names(y) <- nm[cols]
}
else {
## Effectively the same as above, expect with
## marginally different error checks. Seems like
## this could have been consolidated...
y <- NextMethod("[")
cols <- names(y)
if (!is.null(cols) && anyNA(cols))
stop("undefined columns selected")
}
## Fix up column names
if (anyDuplicated(cols))
names(y) <- make.unique(cols)
## Set 'automatic', 'compressed' row names, and the
## class for our return.
attr(y, "row.names") <- .row_names_info(x, 0L)
attr(y, "class") <- oldClass(x)
return(y)
}
Although the dispatch was somewhat awkward, this mostly
seems sane. There is a bit of code duplication and weirdness
through calling NextMethod("[")
, but it remains
functional and understandable. Although terribly opaque,
NextMethod("[")
is the only way to get at that subsetting
primitive past [.data.frame
(as [.list
does not exist),
since list
is not really an S3 class, just a SEXP
type.
Let’s move on! The next branch focuses on what happens if
i
, our row subetting bit, is missing:
## If `i` is missing, e.g. in x[, j]...
if (missing(i)) {
## For some reason, we provide a shortcut for when:
##
## 1. drop is TRUE,
## 2. j is missing, and
## 3. x is a one-column data.frame.
##
## This amounts to being a funky way of extracting
## the first column from a single column data.frame,
## but it seems odd to specialize for that case ...
if (drop && !has.j && length(x) == 1L)
return(.subset2(x, 1L))
## Get names as non-null character vector.
nm <- names(x)
if (is.null(nm))
nm <- character()
## Do some more weirdness, similar to before, where
## we inline some awkward error handling over what is
## really just a call to `.subset(x, j)`.
if (has.j && !is.character(j) && anyNA(nm)) {
names(nm) <- names(x) <- seq_along(x)
y <- .subset(x, j)
cols <- names(y)
if (anyNA(cols))
stop("undefined columns selected")
cols <- names(y) <- nm[cols]
}
else {
y <- if (has.j)
.subset(x, j)
else x
cols <- names(y)
if (anyNA(cols))
stop("undefined columns selected")
}
## If, after subsetting, we have a single-column
## data.frame then do another awkward early return.
if (drop && length(y) == 1L)
return(.subset2(y, 1L))
## Fix up names.
if (anyDuplicated(cols))
names(y) <- make.unique(cols)
## Get the number of rows. Wait, why are we using
## .row_names_info instead of just plain `nrow`? If
## you look at how `nrow()` works for a `data.frame`,
## you'll see...
##
## nrow()
## -> dim()
## -> dim.data.frame()
## -> .row_names_info(, 2L)
##
## so it is just a shortcut to that dispatch.
nrow <- .row_names_info(x, 2L)
## For one-row data.frames, if drop is TRUE, try to
## return the row as a vector, letting structure
## handle coercion. Yuck!
if (drop && !mdrop && nrow == 1L)
return(structure(y, class = NULL, row.names = NULL))
## Otherwise, set the class, row.names, and return.
else {
attr(y, "class") <- oldClass(x)
attr(y, "row.names") <- .row_names_info(x, 0L)
return(y)
}
}
Phew! We’ve now seen the dispatches for:
df[i]
; ie, column subsetting, anddf[, j]
, ie, column subsetting withi
missing.
Now let’s get to the really fun stuff. Now, we need to see how row subsetting, and potentially column subsetting, gets performed.
## Let's make a 'copy' of 'x' and call it 'xx', because
## we're going to start mutating that object. Note that
## this should be a shallow copy (ie, we are just
## placing a new reference to the data structure
## pointed to by `x`, called `xx`)
xx <- x
cols <- names(xx)
## Replace x with a list of length x, containing the
## same attributes of `xx`. Really, this copies _all_
## attributes from `xx` to `x`.
x <- vector("list", length(x))
x <- .Internal(copyDFattr(xx, x))
## And then, now that we've copied all of the
## attributes, let's go ahead and clear out the
## data.frame specific ones, because ... I guess we
## need to refresh them later?
oldClass(x) <- attr(x, "row.names") <- NULL
## If `j` was specified (ie, this is `df[i, j]`)...
if (has.j) {
## Get names as character
nm <- names(x)
if (is.null(nm))
nm <- character()
## If any names are NA, reset the names as integers
if (!is.character(j) && anyNA(nm))
names(nm) <- names(x) <- seq_along(x)
## Oh! Now you know! We removed the class from `x` so
## that we could use `[`, and get back at that
## `[.list` primitive hiding in the C sources. So
## this is where we do column subsetting. This is
## still weird as heck though, since we're subsetting
## an _empty list_; ie, list whose elements are just
## NULL. I guess this is a way of getting the names
## of x in an appropriate order?
x <- x[j]
## Some more unreadable nonsense handling the special
## case of drop being true, and `x` being a length
## one vector.
cols <- names(x)
if (drop && length(x) == 1L) {
## Note that, within this branch, we no longer care
## what `x` is. So we just used that as some weird
## proxy subsetting object; now we go back to `xx`
## (which is the orginal `x` -- remmber how we
## copied it? Why on earth is this code so
## convoluted?)
if (is.character(i)) {
rows <- attr(xx, "row.names")
i <- pmatch(i, rows, duplicates.ok = TRUE)
}
xj <- .subset2(.subset(xx, j), 1L)
return(if (length(dim(xj)) != 2L) xj[i]
else xj[i, , drop = FALSE])
}
## More error checking + 'fixing up' of names and
## such.
if (anyNA(cols))
stop("undefined columns selected")
if (!is.null(names(nm)))
cols <- names(x) <- nm[cols]
## Get some index vectors, so that we can handle the
## column re-organizing / subsetting later.
nxx <- structure(seq_along(xx), names = names(xx))
## This vector governs the column indices, which we
## will see are used later when copying from `xx`
## back into `x`.
sxx <- match(nxx[j], seq_along(xx))
}
## When `j` is not supplied, this implies we want all
## columns without re-ordering -- so just take the
## indices from 1 to length(x).
else sxx <- seq_along(x)
## Okay, now let's look at row subsetting.
rows <- NULL
## If `i` is a character vector, figure out the indices
## by matching `i` to the row.names of `xx`. (Not `x`
## -- remember, we decided to turn that into a `list`
## to get at the `[` primitive)
if (is.character(i)) {
## Directly access the attribute, to avoid dispatch /
## other ugliness in `rownames()`, or `row.names()`.
rows <- attr(xx, "row.names")
i <- pmatch(i, rows, duplicates.ok = TRUE)
}
## For each vector in `x`, our empty list...
for (j in seq_along(x)) {
## Figure out the vector in `xx` that we want to put
## at the `j`th column.
xj <- xx[[sxx[j]]]
## If xj is not a 2D object (e.g. a matrix), then
## just mash it into x[[j]] (subsetting with `i` --
## there's the row subsetting kicking in!) Also,
## implicitly respect the `drop` attribute here.
##
## Note that we're now populating `x`, our 'output'
## data.frame, with elements of `xx`. So very
## roundabout!
if (length(dim(xj)) != 2L)
x[[j]] <- xj[i]
## Otherwise, call the same function, but with `drop`
## explicitly as FALSE (ignoring if it was set TRUE
## earlier). This way, we can stuff matrices (or
## other data.frames!) in without dropping
## attributes.
else
x[[j]] <- xj[i, , drop = FALSE]
}
Phew! I feel dizzy. Do you? We’re actually not quite done –
there are two more branches of code dealing with more messy
row.names
stuff, which isn’t worth going into.
Overall, the code itself is
pretty unreadable (one has to slow down and think to have
a hope of understanding of what’s going on).
What about performance? Surprisingly, there is nothing really
egregiously bad – the main culprit is the use of temporaries,
alongside many calls to [
. Instead, there is just a lot
of awkward error handling, alongside some somewhat questionable
methodology for performing the subsetting.
However, given that this is
implemented as a primitive (and R
has gotten better about
performing shallow copies when appropriate) there aren’t
too many needless allocations. And, it’s fairly easy to imagine
this having evolved over R
’s lifetime to something that was
once cute and tidy to something that has had to remain backwards
compatible as R
evolved.
That said, there’s nothing stopping us from implementing
the same functionality in a much more sane manner, so let’s
try doing this ourselves in Rcpp
. Of course, having the same
amount of flexibility will be tough, but let’s see what
we get specifically for an integer-integer subsetting case.
One could imagine implementing similar functionality for
different pairs of indexing, or forcing a final dispatch to
the integer-integer case.
This will be a slightly over simplified implementation that:
- Neglects most error checking,
- Only allows for integer vector subsetting,
- Only allows subsetting for four of the atomic types (numeric, integer, logical, string),
- Ignores
row.names
, and - Never drops.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
SEXP subset_df(SEXP x,
IntegerVector row_indices,
IntegerVector column_indices) {
// Get some useful variables
// (lengths of index vectors)
int column_indices_n = column_indices.size();
int row_indices_n = row_indices.size();
// Translate from R to C indices.
// This could be optimized...
row_indices = row_indices - 1;
column_indices = column_indices - 1;
// Allocate the output 'data.frame'.
List output = no_init(column_indices_n);
// Set the names, based on the names of 'x'.
// We'll assume it has names.
CharacterVector x_names =
as<CharacterVector>(Rf_getAttrib(x, R_NamesSymbol));
// Use Rcpp's sugar subsetting to subset names.
output.attr("names") = x_names[column_indices];
// Loop and fill!
for (int j = 0; j < column_indices_n; ++j)
{
// Get the j'th element of 'x'. We don't need to
// PROTECT it since it's implicitly protected as a
// child of 'x'.
SEXP element = VECTOR_ELT(x, column_indices[j]);
// Get the 'rows' for that vector, and fill.
SEXP vec = PROTECT(
Rf_allocVector(TYPEOF(element), row_indices_n)
);
for (int i = 0; i < row_indices_n; ++i)
{
// Copying vectors is a pain in the butt, because
// we need to know the actual type underneath the
// SEXP. I'll just handle a couple of the main
// types. One could imagine simplifying this with
// some macro magic...
switch (TYPEOF(vec))
{
case REALSXP:
REAL(vec)[i] =
REAL(element)[row_indices[i]];
break;
case INTSXP:
case LGLSXP:
INTEGER(vec)[i] =
INTEGER(element)[row_indices[i]];
break;
case STRSXP:
SET_STRING_ELT(vec, i,
STRING_ELT(element, row_indices[i]));
break;
}
}
// Don't need to protect 'vec' anymore
UNPROTECT(1);
// And make sure the output list now
// refers to that vector!
SET_VECTOR_ELT(output, j, vec);
}
// Finally, copy the attributes of `x` to `output`...
Rf_copyMostAttrib(x, output);
// ... but set the row names manually. Note that this
// is the secret method for creating 'compact' row
// names, whereby internally R stores the 'empty' row
// names object as 'c(NA_integer_, -<nrow>)'.
Rf_setAttrib(output, R_RowNamesSymbol,
IntegerVector::create(NA_INTEGER, -row_indices_n));
return output;
}
Let’s see if it works. It should behave identically to
df[i, j, drop = FALSE]
, for some subset of arguments.
df <- data.frame(
x = 1:5,
y = letters[1:5],
z = c(TRUE, FALSE, FALSE, TRUE, FALSE),
stringsAsFactors = FALSE
)
subset_df(df, 1:3, 1:2)
x y
1 1 a
2 2 b
3 3 c
subset_df(df, c(1, 2, 5), c(1, 3))
x z
1 1 TRUE
2 2 FALSE
3 5 FALSE
all.equal(
subset_df(df, c(1, 2, 5), c(1, 3)),
df[c(1, 2, 5), c(1, 3), drop = FALSE]
)
[1] "Attributes: < Component \"row.names\": Mean relative difference: 0.6666667 >"
It looks identical, barring the difference with row.names
,
which we have explicitly decided to avoid.
What about performance? Note that I am somewhat intentionally
cheating because I don’t have any performance hit in
generating and populating the row.names
attribute – but
how much does it really matter?
library("microbenchmark")
df <- data.frame(
x = 1:1E2,
y = sample(letters, 1E2, TRUE)
)
microbenchmark(
R = df[5:10, 2, drop = FALSE],
Cpp = subset_df(df, 5:10, 2)
)
Unit: microseconds
expr min lq mean median uq max neval cld
R 80.358 83.324 89.56236 85.2925 90.503 309.587 100 b
Cpp 3.073 3.711 5.31664 5.5675 5.947 29.017 100 a
Yuck! R
is really, really slow when it comes to taking
small subsets of a large vector, or at least much slower
than it should be. (For some more musing on this topic,
see Extracting a single value from a data frame
in Hadley’s Advanced R
book.)
This is almost certainly due to the messy handling required
in generating and validating the row.names
attribute,
as well as checking and handling of duplicate names.
However, we get these huge gains because:
-
We avoid allocating a bunch of (small, but temporary and unneeded) objects at the
R
level, -
We avoid the excessive branching associated with the
drop
argument, -
We avoid unnecessary, repeated dispatches through
[
and directly access the memory we want to use – also permitting the compiler to make some optimizations as necessary,
More than performance, though, what really counts is that,
when uncommented, R
’s implementation of [.data.frame
is pretty unreadable. You can read the code online,
which even comes with some (albeit bare) comments. It seems
unfortunate that R
strips comments out of code included in
base
(and other packages), since they would be enormously
useful for understanding what’s going on. (Not the mention
the rather surprising formatting style…)
In the end, what we see is that [.data.frame
is really just
some calls to [
(as a primitive dispatching to the internal,
non-data.frame implementation), .subset()
, and .subset2()
.
You can see such a simplified implementation in
dplyr
– note the much more relatively clean
implementation that arises when the messiness of drop
and
row.names
is avoided!
For some finishing remarks – one of the guiding tenants of
C++
is:
You don’t pay for what you don’t use.
and this is a very nice thing to keep in mind when implementing
functions and defining interfaces. It’s useful to contrain
interfaces when possible, as it allows you to design your
data structures in such a way that the unnecessary cruft only
exists when it is actually needed.
Unfortunately, this is very often not true in R
, and
especially [.data.frame
, where every call forces you to
check the hoops of row.names
, drop
, and uniqueness of
names
, whether you care or not. Of course, the
priorities of R
and C++
differ greatly, but it is an
important thing to keep in mind when attempting to write
performant R
code.