使data.table命令的R用户定义函数-如何正确引用列

Question

我有df1数据

df1 <- data.frame(id=c("A","A","A","A","B","B","B","B"),
                        year=c(2014,2014,2015,2015),
                        month=c(1,2),
                        new.employee=c(4,6,2,6,23,2,5,34))

  id year month new.employee
1  A 2014     1            4
2  A 2014     2            6
3  A 2015     1            2
4  A 2015     2            6
5  B 2014     1           23
6  B 2014     2            2
7  B 2015     1            5
8  B 2015     2           34

具有以下功能的预期结果：

library(data.table) # V1.9.6+
temp <- setDT(df1)[month == 2L, .(id, frank(-new.employee)), by = year]
df1[temp, new.employee.rank := i.V2, on = c("year", "id")]
df1
#    id year month new.employee new.employee.rank
# 1:  A 2014     1            4                 1
# 2:  A 2014     2            6                 1
# 3:  A 2015     1            2                 2
# 4:  A 2015     2            6                 2
# 5:  B 2014     1           23                 2
# 6:  B 2014     2            2                 2
# 7:  B 2015     1            5                 1
# 8:  B 2015     2           34                 1

现在，我想通过创建用户定义的函数来改变输入来进行数据挖掘，在上面的示例中该函数为new.employee。 我尝试了一些方法，但是它们没有用：

第一次尝试：

 myRank <- function(data,var) { temp <- setDT(data)[month == 2L, .(id, frank(-var)), by = year] data[temp, new.employee.rank := i.V2, on = c("year", "id")] return(data) } myRank(df1,new.employee)

is.data.frame（x）中的错误：找不到对象“ new.employee”

第二次尝试：
```
 myRank(df1,df1$new.employee) 
```

什么都没出现

第三次尝试：我稍微改变一下功能
```
 myRank <- function(data,var) { temp <- setDT(data)[month == 2L, .(id, rank(data$var)), by = year] data[temp, new.employee.rank := i.V2, on = c("year", "id")] return(data) } 
```
myRank（df1，df1 $ new.employee）警告消息：1：在is.na（x）中：is.na（）应用于类型为'NULL'的非（列表或向量）2：在[.data.table （setDT（data），month == 2L，。（id，rank（data $ var）），）：组1的j结果的第2项是零长度，将填充2个NA以匹配其中最长的列结果：后面的组可能有类似的问题，但只报告了第一个以节省警告缓冲区的填充量3：在is.na（x）中：is.na（）应用于类型为'NULL的非（列表或向量） “

我看过类似的问题，但我的R经验不足以理解这些问题。

Answer 1

data.table默认情况下使用非标准评估（除非您开始with = FALSE来搞乱），因此，您将需要按名称引用列或使用get 。 代码的另一个问题（如注释中所述）是您正在调用new.employee ，但未在df1范围之外定义。 如果您希望阻止R在将其传递到数据集之前对其求值，则可以使用deparse(substitute(var))组合将阻止求值，然后将var转换为字符串，然后可以将该字符串传递给get或eval(as.name())组合（虽然在data.table范围内执行的操作完全不同，但结果相同）。 最后，在函数中使用:=后出现打印问题。 即使一切正常， return(data)也不做任何事情，您将需要通过使用附加的[]或显式调用print来强制进行print

这是一个可能的解决方案

myRank <- function(data, var) {
  var <- deparse(substitute(var)) ## <~~~ Note this
  temp <- setDT(data)[month == 2L, .(id, frank(-get(var))), by = year] ## <~~ Note the get
  data[temp, new.employee.rank := i.V2, on = c("year", "id")][] ## <~~ Note the []
}       
myRank(df1, new.employee)
#    id year month new.employee new.employee.rank
# 1:  A 2014     1            4                 1
# 2:  A 2014     2            6                 1
# 3:  A 2015     1            2                 2
# 4:  A 2015     2            6                 2
# 5:  B 2014     1           23                 2
# 6:  B 2014     2            2                 2
# 7:  B 2015     1            5                 1
# 8:  B 2015     2           34                 1

要么

myRank <- function(data, var) {
  var <- as.name(deparse(substitute(var))) ## <~~~ Note additional as.name
  temp <- setDT(data)[month == 2L, .(id, frank(-eval(var))), by = year] ## <~ Note the eval
  data[temp, new.employee.rank := i.V2, on = c("year", "id")][]
} 
myRank(df1, new.employee)
#    id year month new.employee new.employee.rank
# 1:  A 2014     1            4                 1
# 2:  A 2014     2            6                 1
# 3:  A 2015     1            2                 2
# 4:  A 2015     2            6                 2
# 5:  B 2014     1           23                 2
# 6:  B 2014     2            2                 2
# 7:  B 2015     1            5                 1
# 8:  B 2015     2           34                 1

我猜第二个选项会更快，因为它避免了从data提取整个列

附带说明，您还可以通过替换新变量名称的创建来交互

new.employee.rank := i.V2

用类似的东西

paste0("New.", var, ".rank") := i.V2

使data.table命令的R用户定义函数-如何正确引用列

问题描述

1 个解决方案

解决方案1
1 2016-01-19 08:54:34

使data.table命令的R用户定义函数-如何正确引用列

问题描述

1 个解决方案

解决方案1 1 2016-01-19 08:54:34

解决方案1
1 2016-01-19 08:54:34