Welcome back! Last post,
I looked at head.default and tail.default, and gave them
a little bit of a hard time, but ultimately there’s nothing
seriously wrong with their implementations. Today, we get to
have a lot more fun. Fasten your seatbelts, because this will
be a bumpy ride.
The usual preamble: [.data.frame is very useful, nice and
terse, and ‘makes sense’ once you adopt the R mindset.
Get rows with df[x, ], get column(s?) with df[, x], or
get columns with just plain old df[x]. Seems intuitive,
on the face of it.
Intuitively, [.data.frame seems like it shouldn’t be too
hard to implement. For the numeric types, we’re trying to
get rows or columns by index; for logical types, we get the
rows or columns for which the vector elements are TRUE;
for character vetors, we match on names or row.names.
In addition, data.frames should only contain atomic
vectors – it is possible to stuff other objects in, but
that is usually quite rare and prone to break in other,
surprising, ways, and so is generally discouraged.
Seems like there isn’t that much room for ugliness, right?
We’ll start with one bit that is really subtle. We’ll call
the subsetting indices as df[i, j]; that is, i denotes
a vector used for row subsetting, while j denotes a vector
used for column subsetting. Why is it
that the following two statements can give different results?
In the first case, we are explicitly setting the j argument
to be missing; in the second case, it is implicitly missing.
It seems surprising, but [.data.frame does an ad-hoc dispatch
based on the number of implicitly missing arguments. How does it
do that?
Can we use the missing() function to help us out here?
Well, that’s not helpful – R still believes that the j
argument is missing in each case (which it is). But how do we differentiate
between this notion of implicit and explicit missingness?
Let’s look at the first few lines of [.data.frame to find
out what really happens behind the scenes.
This is made more complicated by the presence of the drop
argument, so try to ignore that bit. What R actually does
is ‘dispatches’ based on the number of args. Let’s see:
In other words, nargs() does not include
implicitly missing arguments when it counts the number
of arguments passed to the function.
Notice that nargs() in f() returns zero
(all arguments are implicitly missing), while nargs() in
f(,) returns two (all arguments are explicitly missing;
it’s as though we had really called f(<missing>, <missing>)).
Okay, so we understand how [.data.frame can discriminate
between df[i] and df[i, ], because nargs() gives us
a funny way to poke at that. Now, let’s look at the first
branch, where Nargs < 3 – implying a df[i] call.
From here on, we’re going to start walking line-by-line
through the code in [.data.frame. Just for posterity,
this is the version of R I’m running – it’s possible
that [.data.frame might look slightly different on your
computer:
Session info
------------
setting value
version R version 3.1.2 (2014-10-31)
system x86_64, darwin13.4.0
ui RStudio (0.98.1091)
language (EN)
collate en_CA.UTF-8
tz America/Vancouver
I’ll write my own comments above the code used for
[.data.frame, explaining what’s going on each step of
the way.
Although the dispatch was somewhat awkward, this mostly
seems sane. There is a bit of code duplication and weirdness
through calling NextMethod("["), but it remains
functional and understandable. Although terribly opaque,
NextMethod("[") is the only way to get at that subsetting
primitive past [.data.frame (as [.list does not exist),
since list is not really an S3 class, just a SEXP type.
Let’s move on! The next branch focuses on what happens if
i, our row subetting bit, is missing:
Phew! We’ve now seen the dispatches for:
df[i]; ie, column subsetting, and
df[, j], ie, column subsetting with i missing.
Now let’s get to the really fun stuff. Now, we need to
see how row subsetting, and potentially column subsetting,
gets performed.
Phew! I feel dizzy. Do you? We’re actually not quite done –
there are two more branches of code dealing with more messy
row.names stuff, which isn’t worth going into.
Overall, the code itself is
pretty unreadable (one has to slow down and think to have
a hope of understanding of what’s going on).
What about performance? Surprisingly, there is nothing really
egregiously bad – the main culprit is the use of temporaries,
alongside many calls to [. Instead, there is just a lot
of awkward error handling, alongside some somewhat questionable
methodology for performing the subsetting.
However, given that this is
implemented as a primitive (and R has gotten better about
performing shallow copies when appropriate) there aren’t
too many needless allocations. And, it’s fairly easy to imagine
this having evolved over R’s lifetime to something that was
once cute and tidy to something that has had to remain backwards
compatible as R evolved.
That said, there’s nothing stopping us from implementing
the same functionality in a much more sane manner, so let’s
try doing this ourselves in Rcpp. Of course, having the same
amount of flexibility will be tough, but let’s see what
we get specifically for an integer-integer subsetting case.
One could imagine implementing similar functionality for
different pairs of indexing, or forcing a final dispatch to
the integer-integer case.
This will be a slightly over simplified implementation that:
Neglects most error checking,
Only allows for integer vector subsetting,
Only allows subsetting for four of the atomic types
(numeric, integer, logical, string),
Ignores row.names, and
Never drops.
Let’s see if it works. It should behave identically to
df[i, j, drop = FALSE], for some subset of arguments.
It looks identical, barring the difference with row.names,
which we have explicitly decided to avoid.
What about performance? Note that I am somewhat intentionally
cheating because I don’t have any performance hit in
generating and populating the row.names attribute – but
how much does it really matter?
Yuck! R is really, really slow when it comes to taking
small subsets of a large vector, or at least much slower
than it should be. (For some more musing on this topic,
see Extracting a single value from a data frame
in Hadley’s Advanced R book.)
This is almost certainly due to the messy handling required
in generating and validating the row.names attribute,
as well as checking and handling of duplicate names.
However, we get these huge gains because:
We avoid allocating a bunch of (small, but temporary
and unneeded) objects at the R level,
We avoid the excessive branching associated with the
drop argument,
We avoid unnecessary, repeated dispatches through [
and directly access the memory we want to use –
also permitting the compiler to make some optimizations
as necessary,
More than performance, though, what really counts is that,
when uncommented, R’s implementation of [.data.frame
is pretty unreadable. You can read the code online,
which even comes with some (albeit bare) comments. It seems
unfortunate that R strips comments out of code included in
base (and other packages), since they would be enormously
useful for understanding what’s going on. (Not the mention
the rather surprising formatting style…)
In the end, what we see is that [.data.frame is really just
some calls to [ (as a primitive dispatching to the internal,
non-data.frame implementation), .subset(), and .subset2().
You can see such a simplified implementation in
dplyr
– note the much more relatively clean
implementation that arises when the messiness of drop and
row.names is avoided!
For some finishing remarks – one of the guiding tenants of
C++ is:
You don’t pay for what you don’t use.
and this is a very nice thing to keep in mind when implementing
functions and defining interfaces. It’s useful to contrain
interfaces when possible, as it allows you to design your
data structures in such a way that the unnecessary cruft only
exists when it is actually needed.
Unfortunately, this is very often not true in R, and
especially [.data.frame, where every call forces you to
check the hoops of row.names, drop, and uniqueness of
names, whether you care or not. Of course, the
priorities of R and C++ differ greatly, but it is an
important thing to keep in mind when attempting to write
performant R code.