简体   繁体   中英

R data.table: accessing column with variable name

I am using the wonderful R data.table package. However, accessing (ie manipulating by reference) a column with a variable name is very clumsy: If we are given a data.table dt which has two columns x and y and we want to add two columns and name it z then the command is

dt = dt[, z := x + y]

Now let us write a function add that takes as arguments a (reference to a) data.table dt and three column names summand1Name , summand2Name and resultName and it is supossed to execute the exact same command as above only with general column names. The solution I am using right now is reflection, ie

add = function(dt, summand1Name, summand2Name, resultName) {
  cmd = paste0('dt = dt[, ', resultName, ' := ', summand1Name, ' + ', summand2Name, ']')
  eval(parse(text=cmd))
  return(dt) # optional since manipulated  by reference
}

However I am absolutely not satisfied with this solution. First of all it's clumsy, it does not make fun to code like this. It is hard to debug and it just pisses me off and burns time. Secondly, it is harder to read and understand. Here is my question:

Can we write this function in a somewhat nicer way?

I am aware of the fact that one can access columns with variable name like so: dt[[resultName]] but when I write

dt[[resultName]] = dt[[summand1Name]] + dt[[summand2Name]]

then data.table starts to complain about having taken copies and not working by reference. I don't want that. Also I like the syntax dt = dt[<all 'database related operations'>] so that everything I am doing is stuck together in one pair of brackets. Isn't it possible to make use of a special symbol like backticks or so in order to indicate that the name currently used is not referencing an actual column of the data table but rather is a placeholder for the name of an actual column?

You can combine the use of () on the LHS of := as well as with = FALSE in referencing a variable on the RHS.

dt <- data.table(a = 1:5, b = 10:14)
my_add <- function(dt, summand1Name, summand2Name, resultName) {
  dt[, (resultName) := dt[, summand1Name, with = FALSE] + 
       dt[, summand1Name, with = FALSE]]
}
my_add(dt, 'a', 'b', 'c')
dt

Edit:

Compared three versions. Mine is the most inefficient... (but will keep it just for reference).

set.seed(1)
dt <- data.table(a = rnorm(10000), b = rnorm(10000))
original_add <- function(dt, summand1Name, summand2Name, resultName) {
  cmd = paste0('dt = dt[, ', resultName, ' := ', summand1Name, ' + ', summand2Name, ']')
  eval(parse(text=cmd))
  return(dt) # optional since manipulated  by reference
}
my_add <- function(dt, summand1Name, summand2Name, resultName) {
  dt[, (resultName) := dt[, summand1Name, with = FALSE] + 
       dt[, summand1Name, with = FALSE]]
}
list_access_add <- function(dt, summand1Name, summand2Name, resultName) {
  dt[, (resultName) := dt[[summand1Name]] + dt[[summand2Name]]]
}
david_add <- function(dt, summand1Name, summand2Name, resultName) {
  dt[, (resultName) := .SD[[summand1Name]] + .SD[[summand2Name]]]
}

microbenchmark::microbenchmark(
  original_add(dt, 'a', 'b', 'c'),
  my_add(dt, 'a', 'b', 'c'),
  list_access_add(dt, 'a', 'b', 'c'),
  david_add(dt, 'a', 'b', 'c'))

## Unit: microseconds
##                                expr      min        lq      mean    median        uq      max
##     original_add(dt, "a", "b", "c")  604.397  659.6395  784.2206  713.0315  776.1295 5070.541
##           my_add(dt, "a", "b", "c") 1063.984 1168.6140 1460.5329 1247.7990 1486.9730 6134.959
##  list_access_add(dt, "a", "b", "c")  272.822  310.9680  422.6424  334.3110  380.6885 3620.463
##        david_add(dt, "a", "b", "c")  389.389  431.9080  542.7955  454.5335  493.4895 3696.992
##  neval
##    100
##    100
##    100
##    100

Edit2:

With one million rows, the result looks like this. As expected the original method perform well as once eval is done this will work fast.

## Unit: milliseconds
##                                expr       min        lq      mean    median        uq      max
##     original_add(dt, "a", "b", "c")  2.493553  3.499039  6.585651  3.607101  4.390051 114.0612
##           my_add(dt, "a", "b", "c") 11.821820 14.512878 28.387841 17.412433 19.642231 117.6359
##  list_access_add(dt, "a", "b", "c")  2.161276  3.133110  6.874885  3.218185  3.407776 107.6853
##        david_add(dt, "a", "b", "c")  2.237089  3.313133  6.047832  3.381757  3.788558 103.7532
##  neval
##    100
##    100
##    100
##    100
new_add <- function(dt, summand1Name, summand2Name, resultName) {
    dt[, (resultName) := rowSums(.SD), .SDcols = c(summand1Name, summand2Name)]
}

This just takes the column names as strings. Adding this to amatsuo_net's speed test, and adding sindri's two versions too, we get the following:

microbenchmark::microbenchmark(
  original_add(dt, 'a', 'b', 'c'),
  my_add(dt, 'a', 'b', 'c'),
  list_access_add(dt, 'a', 'b', 'c'),
  david_add(dt, 'a', 'b', 'c'),
  new_add(dt, 'a', 'b', 'c'),
  get_add(dt, 'a', 'b', 'c'),
  mget_add(dt, 'a', 'b', 'c'))

## Unit: microseconds
##                               expr   min      lq     mean median      uq     max neval
##    original_add(dt, "a", "b", "c") 433.3  491.00  635.315  531.4  600.00  6064.0   100
##          my_add(dt, "a", "b", "c") 978.0 1062.35 1310.808 1208.8 1357.80  4157.3   100
## list_access_add(dt, "a", "b", "c") 303.9  331.95  432.939  363.8  434.05  3361.6   100
##       david_add(dt, "a", "b", "c") 401.3  440.65  659.748  474.5  577.75 11623.0   100
##         new_add(dt, "a", "b", "c") 518.9  588.30  765.394  667.1  741.95  5636.5   100
##         get_add(dt, "a", "b", "c") 415.1  454.50  674.699  491.1  546.70  9804.3   100
##        mget_add(dt, "a", "b", "c") 425.4  474.65  596.165  533.2  590.75  3888.0   100

It's not the fastest among the versions, but if you're looking for code that's painless to write then this is pretty simple. Since it works off of rowSums , it can also more easily be generalised to sum over an arbitrary number of columns at once.

Additionally, since dt isn't mentioned inside the square brackets, you can add this column definition inside a data.table "pipe" instead of as a function, if you want to:

dt[, (resultName) := rowSums(.SD), .SDcols = c(summand1Name, summand2Name)
][, lapply(.SD, range), .SDcols = c(summand1Name, summand2Name, resultName)
][... # etc
]

Using get() :

add <- function(dt, summand1Name, summand2Name, resultName) {
  dt[, (resultName) := get(summand1Name) + get(summand1Name)]
}

Using mget() :

add2 <- function(dt, summand1Name, summand2Name, resultName) {
  dt[, (resultName) := do.call(`+`, mget(c(summand1Name,summand2Name)))]
}
# Let
dt <- data.table(a = 1:5, b = 10:14)
# Then
add(dt, 'x', 'y', 'z')
dt[]
#    x y z
# 1: 1 2 2

Here's another solution using substitute . I generally try to avoid using substitute , but I think it's the only way of using fast data.table and := code instead of native list access.

I kept to the interface of amatsuo_net.

set.seed(1)
dt <- data.table(a = rnorm(10000), b = rnorm(10000))

snaut_add <- function(dt, summand1, summand2, resultName){
  eval(substitute(
    dt[, z := x + y],
    list(
      z=as.symbol(resultName),
      x=as.symbol(summand1),
      y=as.symbol(summand2)
    )
  ))
}

snaut_add(dt, "a", "b", "c")
dt

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM