Understanding [.data.frame


Welcome back! Last post, I looked at head.default and tail.default, and gave them a little bit of a hard time, but ultimately there’s nothing seriously wrong with their implementations. Today, we get to have a lot more fun. Fasten your seatbelts, because this will be a bumpy ride.

The usual preamble: [.data.frame is very useful, nice and terse, and ‘makes sense’ once you adopt the R mindset. Get rows with df[x, ], get column(s?) with df[, x], or get columns with just plain old df[x]. Seems intuitive, on the face of it.

Intuitively, [.data.frame seems like it shouldn’t be too hard to implement. For the numeric types, we’re trying to get rows or columns by index; for logical types, we get the rows or columns for which the vector elements are TRUE; for character vetors, we match on names or row.names. In addition, data.frames should only contain atomic vectors – it is possible to stuff other objects in, but that is usually quite rare and prone to break in other, surprising, ways, and so is generally discouraged. Seems like there isn’t that much room for ugliness, right?

We’ll start with one bit that is really subtle. We’ll call the subsetting indices as df[i, j]; that is, i denotes a vector used for row subsetting, while j denotes a vector used for column subsetting. Why is it that the following two statements can give different results?

df[x, ]
df[x]

In the first case, we are explicitly setting the j argument to be missing; in the second case, it is implicitly missing. It seems surprising, but [.data.frame does an ad-hoc dispatch based on the number of implicitly missing arguments. How does it do that?

Can we use the missing() function to help us out here?

f <- function(i, j) cbind(missing(i), missing(j))
rbind(
  f(1), ## implicitly missing
  f(1,) ## explicitly missing
)
​      [,1] [,2]
[1,] FALSE TRUE
[2,] FALSE TRUE

Well, that’s not helpful – R still believes that the j argument is missing in each case (which it is). But how do we differentiate between this notion of implicit and explicit missingness?

Let’s look at the first few lines of [.data.frame to find out what really happens behind the scenes.

function (x, i, j, drop = if (missing(i)) TRUE else length(cols) == 
    1) 
{
    mdrop <- missing(drop)
    Narg <- nargs() - (!mdrop)
    has.j <- !missing(j)
    if (!all(names(sys.call()) %in% c("", "drop")) && !isS4(x)) 
        warning("named arguments other than 'drop' are discouraged")
    if (Narg < 3L) {
      < ... snip ... >

This is made more complicated by the presence of the drop argument, so try to ignore that bit. What R actually does is ‘dispatches’ based on the number of args. Let’s see:

f <- function(x, y) nargs()
rbind(f(),     ## nothing passed
      f(1),    ## single argument
      f(1,),   ## argument + explicit missing
      f(,))    ## both explicit missing
​     [,1]
[1,]    0
[2,]    1
[3,]    2
[4,]    2

In other words, nargs() does not include implicitly missing arguments when it counts the number of arguments passed to the function. Notice that nargs() in f() returns zero (all arguments are implicitly missing), while nargs() in f(,) returns two (all arguments are explicitly missing; it’s as though we had really called f(<missing>, <missing>)).

Okay, so we understand how [.data.frame can discriminate between df[i] and df[i, ], because nargs() gives us a funny way to poke at that. Now, let’s look at the first branch, where Nargs < 3 – implying a df[i] call.

From here on, we’re going to start walking line-by-line through the code in [.data.frame. Just for posterity, this is the version of R I’m running – it’s possible that [.data.frame might look slightly different on your computer:

Session info
------------
  
setting  value                       
version  R version 3.1.2 (2014-10-31)
system   x86_64, darwin13.4.0        
ui       RStudio (0.98.1091)         
language (EN)                        
collate  en_CA.UTF-8                 
tz       America/Vancouver           

I’ll write my own comments above the code used for [.data.frame, explaining what’s going on each step of the way.

## This check implies the call is of the form
## `df[i]` or `df[i, drop = .`, so we want to do
## some kind of column subsetting.
if (Narg < 3L) {
  
  ## Ignore 'drop' -- when doing single-element
  ## subsetting, we always return a data.frame
  if (!mdrop) 
    warning("'drop' argument will be ignored")
  
  ## If there was no `i` argument, implying a call
  ## like `df[]`, just return `df`.
  if (missing(i)) 
    return(x)
  
  ## If `i` is a matrix, convert `x` to a matrix
  ## (!?) and then subset that matrix with `i`. I
  ## still do not understand matrix subsetting of
  ## data.frames (or matrix subsetting of
  ## matrices, for that matter...)
  if (is.matrix(i)) 
    return(as.matrix(x)[i])
  
  ## Get the names, and ensure it's a character
  ## vector
  nm <- names(x)
  if (is.null(nm)) 
    nm <- character()
  
  ## If i is not a character (ie, numeric or
  ## logical subsetting...), AND one (or more) of
  ## the names is NA...
  if (!is.character(i) && anyNA(nm)) {
    
    ## Replace the names with indices (1 to ncol(x))
    names(nm) <- names(x) <- seq_along(x)
    
    ## Call the next method after `[.data.frame`. In
    ## this case, that implies calling an internal
    ## generic; this effectively amounts to calling
    ## `as.list(x)[i]`, but this gets done in internal
    ## C code (there is no `[.list` S3 method
    ## registered). This is the joy of 'internal
    ## generics'! The intention of this code is likely
    ## to avoid having to actually call `as.list(x)`,
    ## which could force a deep copy.
    ## 
    ## NB: As Hadley pointed out, this could simply
    ## have been `.subset2()`, which would have been
    ## more readable (and, in fact, is used in other
    ## parts of this function anyway!)
    y <- NextMethod("[")
    
    ## Get and validate the names
    cols <- names(y)
    if (anyNA(cols)) 
      stop("undefined columns selected")
    cols <- names(y) <- nm[cols]
  }
  else {
    
    ## Effectively the same as above, expect with
    ## marginally different error checks. Seems like
    ## this could have been consolidated...
    y <- NextMethod("[")
    cols <- names(y)
    if (!is.null(cols) && anyNA(cols)) 
      stop("undefined columns selected")
  }
  
  ## Fix up column names
  if (anyDuplicated(cols)) 
    names(y) <- make.unique(cols)
  
  ## Set 'automatic', 'compressed' row names, and the 
  ## class for our return.
  attr(y, "row.names") <- .row_names_info(x, 0L)
  attr(y, "class") <- oldClass(x)
  return(y)
}

Although the dispatch was somewhat awkward, this mostly seems sane. There is a bit of code duplication and weirdness through calling NextMethod("["), but it remains functional and understandable. Although terribly opaque, NextMethod("[") is the only way to get at that subsetting primitive past [.data.frame (as [.list does not exist), since list is not really an S3 class, just a SEXP type.

Let’s move on! The next branch focuses on what happens if i, our row subetting bit, is missing:

## If `i` is missing, e.g. in x[, j]...
if (missing(i)) {
  
  ## For some reason, we provide a shortcut for when:
  ##
  ##   1. drop is TRUE,
  ##   2. j is missing, and
  ##   3. x is a one-column data.frame.
  ##
  ## This amounts to being a funky way of extracting
  ## the first column from a single column data.frame,
  ## but it seems odd to specialize for that case ...
  if (drop && !has.j && length(x) == 1L) 
      return(.subset2(x, 1L))
  
  ## Get names as non-null character vector.
  nm <- names(x)
  if (is.null(nm)) 
      nm <- character()
  
  ## Do some more weirdness, similar to before, where
  ## we inline some awkward error handling over what is
  ## really just a call to `.subset(x, j)`.
  if (has.j && !is.character(j) && anyNA(nm)) {
      names(nm) <- names(x) <- seq_along(x)
      y <- .subset(x, j)
      cols <- names(y)
      if (anyNA(cols)) 
          stop("undefined columns selected")
      cols <- names(y) <- nm[cols]
  }
  else {
      y <- if (has.j) 
          .subset(x, j)
      else x
      cols <- names(y)
      if (anyNA(cols)) 
          stop("undefined columns selected")
  }
  
  ## If, after subsetting, we have a single-column 
  ## data.frame then do another awkward early return.
  if (drop && length(y) == 1L) 
      return(.subset2(y, 1L))
  
  ## Fix up names.
  if (anyDuplicated(cols)) 
      names(y) <- make.unique(cols)
  
  ## Get the number of rows. Wait, why are we using 
  ## .row_names_info instead of just plain `nrow`? If
  ## you look at how `nrow()` works for a `data.frame`,
  ## you'll see...
  ##
  ##    nrow() 
  ##        -> dim() 
  ##        -> dim.data.frame()
  ##        -> .row_names_info(, 2L)
  ##
  ## so it is just a shortcut to that dispatch.
  nrow <- .row_names_info(x, 2L)
  
  ## For one-row data.frames, if drop is TRUE, try to 
  ## return the row as a vector, letting structure
  ## handle coercion. Yuck!
  if (drop && !mdrop && nrow == 1L) 
      return(structure(y, class = NULL, row.names = NULL))
  
  ## Otherwise, set the class, row.names, and return.
  else {
      attr(y, "class") <- oldClass(x)
      attr(y, "row.names") <- .row_names_info(x, 0L)
      return(y)
  }
}

Phew! We’ve now seen the dispatches for:

  1. df[i]; ie, column subsetting, and
  2. df[, j], ie, column subsetting with i missing.

Now let’s get to the really fun stuff. Now, we need to see how row subsetting, and potentially column subsetting, gets performed.

## Let's make a 'copy' of 'x' and call it 'xx', because
## we're going to start mutating that object. Note that
## this should be a shallow copy (ie, we are just
## placing a new reference to the data structure
## pointed to by `x`, called `xx`)
xx <- x
cols <- names(xx)

## Replace x with a list of length x, containing the 
## same attributes of `xx`. Really, this copies _all_ 
## attributes from `xx` to `x`.
x <- vector("list", length(x))
x <- .Internal(copyDFattr(xx, x))

## And then, now that we've copied all of the
## attributes, let's go ahead and clear out the
## data.frame specific ones, because ... I guess we
## need to refresh them later?
oldClass(x) <- attr(x, "row.names") <- NULL

## If `j` was specified (ie, this is `df[i, j]`)...
if (has.j) {
  
  ## Get names as character
  nm <- names(x)
  if (is.null(nm)) 
    nm <- character()
  
  ## If any names are NA, reset the names as integers
  if (!is.character(j) && anyNA(nm)) 
    names(nm) <- names(x) <- seq_along(x)
  
  ## Oh! Now you know! We removed the class from `x` so
  ## that we could use `[`, and get back at that
  ## `[.list` primitive hiding in the C sources. So
  ## this is where we do column subsetting. This is
  ## still weird as heck though, since we're subsetting
  ## an _empty list_; ie, list whose elements are just
  ## NULL. I guess this is a way of getting the names
  ## of x in an appropriate order?
  x <- x[j]
  
  ## Some more unreadable nonsense handling the special
  ## case of drop being true, and `x` being a length 
  ## one vector.
  cols <- names(x)
  if (drop && length(x) == 1L) {
    
    ## Note that, within this branch, we no longer care
    ## what `x` is. So we just used that as some weird
    ## proxy subsetting object; now we go back to `xx`
    ## (which is the orginal `x` -- remmber how we
    ## copied it? Why on earth is this code so
    ## convoluted?)
    if (is.character(i)) {
      rows <- attr(xx, "row.names")
      i <- pmatch(i, rows, duplicates.ok = TRUE)
    }
    xj <- .subset2(.subset(xx, j), 1L)
    return(if (length(dim(xj)) != 2L) xj[i]
           else xj[i, , drop = FALSE])
  }
  
  ## More error checking + 'fixing up' of names and
  ## such.
  if (anyNA(cols)) 
    stop("undefined columns selected")
  if (!is.null(names(nm))) 
    cols <- names(x) <- nm[cols]
  
  ## Get some index vectors, so that we can handle the 
  ## column re-organizing / subsetting later.
  nxx <- structure(seq_along(xx), names = names(xx))
  
  ## This vector governs the column indices, which we 
  ## will see are used later when copying from `xx` 
  ## back into `x`.
  sxx <- match(nxx[j], seq_along(xx))
}

## When `j` is not supplied, this implies we want all 
## columns without re-ordering -- so just take the
## indices from 1 to length(x).
else sxx <- seq_along(x)

## Okay, now let's look at row subsetting.
rows <- NULL

## If `i` is a character vector, figure out the indices
## by matching `i` to the row.names of `xx`. (Not `x`
## -- remember, we decided to turn that into a `list`
## to get at the `[` primitive)
if (is.character(i)) {
  
  ## Directly access the attribute, to avoid dispatch /
  ## other ugliness in `rownames()`, or `row.names()`.
  rows <- attr(xx, "row.names")
  i <- pmatch(i, rows, duplicates.ok = TRUE)
}

## For each vector in `x`, our empty list...
for (j in seq_along(x)) {
  
  ## Figure out the vector in `xx` that we want to put
  ## at the `j`th column.
  xj <- xx[[sxx[j]]]
  
  ## If xj is not a 2D object (e.g. a matrix), then
  ## just mash it into x[[j]] (subsetting with `i` --
  ## there's the row subsetting kicking in!) Also,
  ## implicitly respect the `drop` attribute here.
  ## 
  ## Note that we're now populating `x`, our 'output' 
  ## data.frame, with elements of `xx`. So very
  ## roundabout!
  if (length(dim(xj)) != 2L)
    x[[j]] <- xj[i]
  
  ## Otherwise, call the same function, but with `drop`
  ## explicitly as FALSE (ignoring if it was set TRUE 
  ## earlier). This way, we can stuff matrices (or
  ## other data.frames!) in without dropping
  ## attributes.
  else
    x[[j]] <- xj[i, , drop = FALSE]
}

Phew! I feel dizzy. Do you? We’re actually not quite done – there are two more branches of code dealing with more messy row.names stuff, which isn’t worth going into. Overall, the code itself is pretty unreadable (one has to slow down and think to have a hope of understanding of what’s going on).

What about performance? Surprisingly, there is nothing really egregiously bad – the main culprit is the use of temporaries, alongside many calls to [. Instead, there is just a lot of awkward error handling, alongside some somewhat questionable methodology for performing the subsetting. However, given that this is implemented as a primitive (and R has gotten better about performing shallow copies when appropriate) there aren’t too many needless allocations. And, it’s fairly easy to imagine this having evolved over R’s lifetime to something that was once cute and tidy to something that has had to remain backwards compatible as R evolved.

That said, there’s nothing stopping us from implementing the same functionality in a much more sane manner, so let’s try doing this ourselves in Rcpp. Of course, having the same amount of flexibility will be tough, but let’s see what we get specifically for an integer-integer subsetting case. One could imagine implementing similar functionality for different pairs of indexing, or forcing a final dispatch to the integer-integer case.

This will be a slightly over simplified implementation that:

  1. Neglects most error checking,
  2. Only allows for integer vector subsetting,
  3. Only allows subsetting for four of the atomic types (numeric, integer, logical, string),
  4. Ignores row.names, and
  5. Never drops.
#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
SEXP subset_df(SEXP x,
               IntegerVector row_indices,
               IntegerVector column_indices) {
                 
  // Get some useful variables
  // (lengths of index vectors)
  int column_indices_n = column_indices.size();
  int row_indices_n = row_indices.size();
  
  // Translate from R to C indices.
  // This could be optimized...
  row_indices = row_indices - 1;
  column_indices = column_indices - 1;

  // Allocate the output 'data.frame'.
  List output = no_init(column_indices_n);

  // Set the names, based on the names of 'x'.
  // We'll assume it has names.
  CharacterVector x_names =
    as<CharacterVector>(Rf_getAttrib(x, R_NamesSymbol));

  // Use Rcpp's sugar subsetting to subset names.
  output.attr("names") = x_names[column_indices];

  // Loop and fill!
  for (int j = 0; j < column_indices_n; ++j)
  {
    // Get the j'th element of 'x'. We don't need to
    // PROTECT it since it's implicitly protected as a
    // child of 'x'.
    SEXP element = VECTOR_ELT(x, column_indices[j]);

    // Get the 'rows' for that vector, and fill.
    SEXP vec = PROTECT(
      Rf_allocVector(TYPEOF(element), row_indices_n)
    );
    
    for (int i = 0; i < row_indices_n; ++i)
    {
      // Copying vectors is a pain in the butt, because
      // we need to know the actual type underneath the
      // SEXP. I'll just handle a couple of the main
      // types. One could imagine simplifying this with
      // some macro magic...
      switch (TYPEOF(vec))
      {
      case REALSXP:
        REAL(vec)[i] =
          REAL(element)[row_indices[i]];
        break;
      case INTSXP:
      case LGLSXP:
        INTEGER(vec)[i] =
          INTEGER(element)[row_indices[i]];
        break;
      case STRSXP:
        SET_STRING_ELT(vec, i,
          STRING_ELT(element, row_indices[i]));
        break;
      }
    }
    
    // Don't need to protect 'vec' anymore
    UNPROTECT(1);

    // And make sure the output list now
    // refers to that vector!
    SET_VECTOR_ELT(output, j, vec);
  }
  
  // Finally, copy the attributes of `x` to `output`...
  Rf_copyMostAttrib(x, output);

  // ... but set the row names manually. Note that this
  // is the secret method for creating 'compact' row
  // names, whereby internally R stores the 'empty' row
  // names object as 'c(NA_integer_, -<nrow>)'.
  Rf_setAttrib(output, R_RowNamesSymbol,
    IntegerVector::create(NA_INTEGER, -row_indices_n));

  return output;

}

Let’s see if it works. It should behave identically to df[i, j, drop = FALSE], for some subset of arguments.

df <- data.frame(
  x = 1:5,
  y = letters[1:5],
  z = c(TRUE, FALSE, FALSE, TRUE, FALSE),
  stringsAsFactors = FALSE
)
subset_df(df, 1:3, 1:2)
​  x y
1 1 a
2 2 b
3 3 c
subset_df(df, c(1, 2, 5), c(1, 3))
​  x     z
1 1  TRUE
2 2 FALSE
3 5 FALSE
all.equal(
  subset_df(df, c(1, 2, 5), c(1, 3)),
  df[c(1, 2, 5), c(1, 3), drop = FALSE]
)
​[1] "Attributes: < Component \"row.names\": Mean relative difference: 0.6666667 >"

It looks identical, barring the difference with row.names, which we have explicitly decided to avoid.

What about performance? Note that I am somewhat intentionally cheating because I don’t have any performance hit in generating and populating the row.names attribute – but how much does it really matter?

library("microbenchmark")

df <- data.frame(
  x = 1:1E2,
  y = sample(letters, 1E2, TRUE)
)

microbenchmark(
  R = df[5:10, 2, drop = FALSE],
  Cpp = subset_df(df, 5:10, 2)
)
​Unit: microseconds
 expr    min     lq     mean  median     uq     max neval cld
    R 80.358 83.324 89.56236 85.2925 90.503 309.587   100   b
  Cpp  3.073  3.711  5.31664  5.5675  5.947  29.017   100  a 

Yuck! R is really, really slow when it comes to taking small subsets of a large vector, or at least much slower than it should be. (For some more musing on this topic, see Extracting a single value from a data frame in Hadley’s Advanced R book.) This is almost certainly due to the messy handling required in generating and validating the row.names attribute, as well as checking and handling of duplicate names.

However, we get these huge gains because:

  1. We avoid allocating a bunch of (small, but temporary and unneeded) objects at the R level,

  2. We avoid the excessive branching associated with the drop argument,

  3. We avoid unnecessary, repeated dispatches through [ and directly access the memory we want to use – also permitting the compiler to make some optimizations as necessary,

More than performance, though, what really counts is that, when uncommented, R’s implementation of [.data.frame is pretty unreadable. You can read the code online, which even comes with some (albeit bare) comments. It seems unfortunate that R strips comments out of code included in base (and other packages), since they would be enormously useful for understanding what’s going on. (Not the mention the rather surprising formatting style…)

In the end, what we see is that [.data.frame is really just some calls to [ (as a primitive dispatching to the internal, non-data.frame implementation), .subset(), and .subset2(). You can see such a simplified implementation in dplyr – note the much more relatively clean implementation that arises when the messiness of drop and row.names is avoided!

For some finishing remarks – one of the guiding tenants of C++ is:

You don’t pay for what you don’t use.

and this is a very nice thing to keep in mind when implementing functions and defining interfaces. It’s useful to contrain interfaces when possible, as it allows you to design your data structures in such a way that the unnecessary cruft only exists when it is actually needed. Unfortunately, this is very often not true in R, and especially [.data.frame, where every call forces you to check the hoops of row.names, drop, and uniqueness of names, whether you care or not. Of course, the priorities of R and C++ differ greatly, but it is an important thing to keep in mind when attempting to write performant R code.


Comments