[英]Make R user-defined-function for data.table commands - How to refer a column properly
我有df1数据
df1 <- data.frame(id=c("A","A","A","A","B","B","B","B"),
year=c(2014,2014,2015,2015),
month=c(1,2),
new.employee=c(4,6,2,6,23,2,5,34))
id year month new.employee
1 A 2014 1 4
2 A 2014 2 6
3 A 2015 1 2
4 A 2015 2 6
5 B 2014 1 23
6 B 2014 2 2
7 B 2015 1 5
8 B 2015 2 34
具有以下功能的预期结果:
library(data.table) # V1.9.6+
temp <- setDT(df1)[month == 2L, .(id, frank(-new.employee)), by = year]
df1[temp, new.employee.rank := i.V2, on = c("year", "id")]
df1
# id year month new.employee new.employee.rank
# 1: A 2014 1 4 1
# 2: A 2014 2 6 1
# 3: A 2015 1 2 2
# 4: A 2015 2 6 2
# 5: B 2014 1 23 2
# 6: B 2014 2 2 2
# 7: B 2015 1 5 1
# 8: B 2015 2 34 1
现在,我想通过创建用户定义的函数来改变输入来进行数据挖掘,在上面的示例中该函数为new.employee。 我尝试了一些方法,但是它们没有用:
第一次尝试:
myRank <- function(data,var) { temp <- setDT(data)[month == 2L, .(id, frank(-var)), by = year] data[temp, new.employee.rank := i.V2, on = c("year", "id")] return(data) } myRank(df1,new.employee)
is.data.frame(x)中的错误:找不到对象“ new.employee”
第二次尝试:
myRank(df1,df1$new.employee)
什么都没出现
第三次尝试:我稍微改变一下功能
myRank <- function(data,var) { temp <- setDT(data)[month == 2L, .(id, rank(data$var)), by = year] data[temp, new.employee.rank := i.V2, on = c("year", "id")] return(data) }
myRank(df1,df1 $ new.employee)警告消息:1:在is.na(x)中:is.na()应用于类型为'NULL'的非(列表或向量)2:在
[.data.table
(setDT(data),month == 2L,。(id,rank(data $ var)),):组1的j结果的第2项是零长度,将填充2个NA以匹配其中最长的列结果:后面的组可能有类似的问题,但只报告了第一个以节省警告缓冲区的填充量3:在is.na(x)中:is.na()应用于类型为'NULL的非(列表或向量) “
我看过类似的问题,但我的R经验不足以理解这些问题。
data.table
默认情况下使用非标准评估(除非您开始with = FALSE
来搞乱),因此,您将需要按名称引用列或使用get
。 代码的另一个问题(如注释中所述)是您正在调用new.employee
,但未在df1
范围之外定义。 如果您希望阻止R在将其传递到数据集之前对其求值,则可以使用deparse(substitute(var))
组合将阻止求值,然后将var
转换为字符串,然后可以将该字符串传递给get
或eval(as.name())
组合(虽然在data.table
范围内执行的操作完全不同,但结果相同)。 最后,在函数中使用:=
后出现打印问题。 即使一切正常, return(data)
也不做任何事情,您将需要通过使用附加的[]
或显式调用print
来强制进行print
这是一个可能的解决方案
myRank <- function(data, var) {
var <- deparse(substitute(var)) ## <~~~ Note this
temp <- setDT(data)[month == 2L, .(id, frank(-get(var))), by = year] ## <~~ Note the get
data[temp, new.employee.rank := i.V2, on = c("year", "id")][] ## <~~ Note the []
}
myRank(df1, new.employee)
# id year month new.employee new.employee.rank
# 1: A 2014 1 4 1
# 2: A 2014 2 6 1
# 3: A 2015 1 2 2
# 4: A 2015 2 6 2
# 5: B 2014 1 23 2
# 6: B 2014 2 2 2
# 7: B 2015 1 5 1
# 8: B 2015 2 34 1
要么
myRank <- function(data, var) {
var <- as.name(deparse(substitute(var))) ## <~~~ Note additional as.name
temp <- setDT(data)[month == 2L, .(id, frank(-eval(var))), by = year] ## <~ Note the eval
data[temp, new.employee.rank := i.V2, on = c("year", "id")][]
}
myRank(df1, new.employee)
# id year month new.employee new.employee.rank
# 1: A 2014 1 4 1
# 2: A 2014 2 6 1
# 3: A 2015 1 2 2
# 4: A 2015 2 6 2
# 5: B 2014 1 23 2
# 6: B 2014 2 2 2
# 7: B 2015 1 5 1
# 8: B 2015 2 34 1
我猜第二个选项会更快,因为它避免了从data
提取整个列
附带说明,您还可以通过替换新变量名称的创建来交互
new.employee.rank := i.V2
用类似的东西
paste0("New.", var, ".rank") := i.V2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.