简体   繁体   English

R data.table:访问具有变量名的列

[英]R data.table: accessing column with variable name

I am using the wonderful R data.table package.我正在使用美妙的 R data.table 包。 However, accessing (ie manipulating by reference) a column with a variable name is very clumsy: If we are given a data.table dt which has two columns x and y and we want to add two columns and name it z then the command is但是,访问(即通过引用操作)具有变量名的列是非常笨拙的:如果给定一个 data.table dt ,它有两列 x 和 y,我们想添加两列并将其命名为 z 那么命令是

dt = dt[, z := x + y]

Now let us write a function add that takes as arguments a (reference to a) data.table dt and three column names summand1Name , summand2Name and resultName and it is supossed to execute the exact same command as above only with general column names.现在让我们编写一个函数add ,它将一个(对 a 的引用)data.table dt和三个列名summand1Namesummand2NameresultName作为参数,并且它被假定执行与上面完全相同的命令,仅使用通用列名。 The solution I am using right now is reflection, ie我现在使用的解决方案是反射,即

add = function(dt, summand1Name, summand2Name, resultName) {
  cmd = paste0('dt = dt[, ', resultName, ' := ', summand1Name, ' + ', summand2Name, ']')
  eval(parse(text=cmd))
  return(dt) # optional since manipulated  by reference
}

However I am absolutely not satisfied with this solution.但是我对这个解决方案绝对不满意。 First of all it's clumsy, it does not make fun to code like this.首先,它很笨拙,像这样编写代码并不有趣。 It is hard to debug and it just pisses me off and burns time.它很难调试,它只会让我生气并浪费时间。 Secondly, it is harder to read and understand.其次,更难阅读和理解。 Here is my question:这是我的问题:

Can we write this function in a somewhat nicer way?我们可以用更好的方式编写这个函数吗?

I am aware of the fact that one can access columns with variable name like so: dt[[resultName]] but when I write我知道这样一个事实,即可以像这样访问具有变量名的列: dt[[resultName]]但是当我写

dt[[resultName]] = dt[[summand1Name]] + dt[[summand2Name]]

then data.table starts to complain about having taken copies and not working by reference.然后 data.table 开始抱怨已经复印了而不是参考工作。 I don't want that.我不想要那个。 Also I like the syntax dt = dt[<all 'database related operations'>] so that everything I am doing is stuck together in one pair of brackets.我也喜欢语法dt = dt[<all 'database related operations'>]这样我所做的一切都被粘在一对括号中。 Isn't it possible to make use of a special symbol like backticks or so in order to indicate that the name currently used is not referencing an actual column of the data table but rather is a placeholder for the name of an actual column?是否可以使用诸如反引号之类的特殊符号来指示当前使用的名称不是引用数据表的实际列而是实际列名称的占位符?

You can combine the use of () on the LHS of := as well as with = FALSE in referencing a variable on the RHS.您可以在:=的 LHS 上结合使用()以及在引用 RHS 上的变量时with = FALSE

dt <- data.table(a = 1:5, b = 10:14)
my_add <- function(dt, summand1Name, summand2Name, resultName) {
  dt[, (resultName) := dt[, summand1Name, with = FALSE] + 
       dt[, summand1Name, with = FALSE]]
}
my_add(dt, 'a', 'b', 'c')
dt

Edit:编辑:

Compared three versions.三个版本对比。 Mine is the most inefficient... (but will keep it just for reference).我的是效率最低的......(但将保留它仅供参考)。

set.seed(1)
dt <- data.table(a = rnorm(10000), b = rnorm(10000))
original_add <- function(dt, summand1Name, summand2Name, resultName) {
  cmd = paste0('dt = dt[, ', resultName, ' := ', summand1Name, ' + ', summand2Name, ']')
  eval(parse(text=cmd))
  return(dt) # optional since manipulated  by reference
}
my_add <- function(dt, summand1Name, summand2Name, resultName) {
  dt[, (resultName) := dt[, summand1Name, with = FALSE] + 
       dt[, summand1Name, with = FALSE]]
}
list_access_add <- function(dt, summand1Name, summand2Name, resultName) {
  dt[, (resultName) := dt[[summand1Name]] + dt[[summand2Name]]]
}
david_add <- function(dt, summand1Name, summand2Name, resultName) {
  dt[, (resultName) := .SD[[summand1Name]] + .SD[[summand2Name]]]
}

microbenchmark::microbenchmark(
  original_add(dt, 'a', 'b', 'c'),
  my_add(dt, 'a', 'b', 'c'),
  list_access_add(dt, 'a', 'b', 'c'),
  david_add(dt, 'a', 'b', 'c'))

## Unit: microseconds
##                                expr      min        lq      mean    median        uq      max
##     original_add(dt, "a", "b", "c")  604.397  659.6395  784.2206  713.0315  776.1295 5070.541
##           my_add(dt, "a", "b", "c") 1063.984 1168.6140 1460.5329 1247.7990 1486.9730 6134.959
##  list_access_add(dt, "a", "b", "c")  272.822  310.9680  422.6424  334.3110  380.6885 3620.463
##        david_add(dt, "a", "b", "c")  389.389  431.9080  542.7955  454.5335  493.4895 3696.992
##  neval
##    100
##    100
##    100
##    100

Edit2:编辑2:

With one million rows, the result looks like this.有一百万行,结果看起来像这样。 As expected the original method perform well as once eval is done this will work fast.正如预期的那样,原始方法执行得很好,因为一旦eval完成,这将很快起作用。

## Unit: milliseconds
##                                expr       min        lq      mean    median        uq      max
##     original_add(dt, "a", "b", "c")  2.493553  3.499039  6.585651  3.607101  4.390051 114.0612
##           my_add(dt, "a", "b", "c") 11.821820 14.512878 28.387841 17.412433 19.642231 117.6359
##  list_access_add(dt, "a", "b", "c")  2.161276  3.133110  6.874885  3.218185  3.407776 107.6853
##        david_add(dt, "a", "b", "c")  2.237089  3.313133  6.047832  3.381757  3.788558 103.7532
##  neval
##    100
##    100
##    100
##    100
new_add <- function(dt, summand1Name, summand2Name, resultName) {
    dt[, (resultName) := rowSums(.SD), .SDcols = c(summand1Name, summand2Name)]
}

This just takes the column names as strings.这只是将列名作为字符串。 Adding this to amatsuo_net's speed test, and adding sindri's two versions too, we get the following:将此添加到amatsuo_net的速度测试中,并添加sindri的两个版本,我们得到以下内容:

microbenchmark::microbenchmark(
  original_add(dt, 'a', 'b', 'c'),
  my_add(dt, 'a', 'b', 'c'),
  list_access_add(dt, 'a', 'b', 'c'),
  david_add(dt, 'a', 'b', 'c'),
  new_add(dt, 'a', 'b', 'c'),
  get_add(dt, 'a', 'b', 'c'),
  mget_add(dt, 'a', 'b', 'c'))

## Unit: microseconds
##                               expr   min      lq     mean median      uq     max neval
##    original_add(dt, "a", "b", "c") 433.3  491.00  635.315  531.4  600.00  6064.0   100
##          my_add(dt, "a", "b", "c") 978.0 1062.35 1310.808 1208.8 1357.80  4157.3   100
## list_access_add(dt, "a", "b", "c") 303.9  331.95  432.939  363.8  434.05  3361.6   100
##       david_add(dt, "a", "b", "c") 401.3  440.65  659.748  474.5  577.75 11623.0   100
##         new_add(dt, "a", "b", "c") 518.9  588.30  765.394  667.1  741.95  5636.5   100
##         get_add(dt, "a", "b", "c") 415.1  454.50  674.699  491.1  546.70  9804.3   100
##        mget_add(dt, "a", "b", "c") 425.4  474.65  596.165  533.2  590.75  3888.0   100

It's not the fastest among the versions, but if you're looking for code that's painless to write then this is pretty simple.它不是所有版本中最快的,但如果您正在寻找轻松编写的代码,那么这非常简单。 Since it works off of rowSums , it can also more easily be generalised to sum over an arbitrary number of columns at once.由于它适用于rowSums ,因此也可以更轻松地将其推广为一次对任意数量的列求和。

Additionally, since dt isn't mentioned inside the square brackets, you can add this column definition inside a data.table "pipe" instead of as a function, if you want to:此外,由于方括号内未提及dt ,因此您可以将此列定义添加到 data.table “管道​​”中,而不是作为函数添加,如果您想:

dt[, (resultName) := rowSums(.SD), .SDcols = c(summand1Name, summand2Name)
][, lapply(.SD, range), .SDcols = c(summand1Name, summand2Name, resultName)
][... # etc
]

Using get() :使用get()

add <- function(dt, summand1Name, summand2Name, resultName) {
  dt[, (resultName) := get(summand1Name) + get(summand1Name)]
}

Using mget() :使用mget()

add2 <- function(dt, summand1Name, summand2Name, resultName) {
  dt[, (resultName) := do.call(`+`, mget(c(summand1Name,summand2Name)))]
}
# Let
dt <- data.table(a = 1:5, b = 10:14)
# Then
add(dt, 'x', 'y', 'z')
dt[]
#    x y z
# 1: 1 2 2

Here's another solution using substitute .这是使用substitute的另一种解决方案。 I generally try to avoid using substitute , but I think it's the only way of using fast data.table and := code instead of native list access.我通常尽量避免使用substitute ,但我认为这是使用快速data.table:=代码而不是本机列表访问的唯一方法。

I kept to the interface of amatsuo_net.我一直在amatsuo_net的界面上。

set.seed(1)
dt <- data.table(a = rnorm(10000), b = rnorm(10000))

snaut_add <- function(dt, summand1, summand2, resultName){
  eval(substitute(
    dt[, z := x + y],
    list(
      z=as.symbol(resultName),
      x=as.symbol(summand1),
      y=as.symbol(summand2)
    )
  ))
}

snaut_add(dt, "a", "b", "c")
dt

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM