简体   繁体   English

使data.table命令的R用户定义函数-如何正确引用列

[英]Make R user-defined-function for data.table commands - How to refer a column properly

I have the df1 data 我有df1数据

df1 <- data.frame(id=c("A","A","A","A","B","B","B","B"),
                        year=c(2014,2014,2015,2015),
                        month=c(1,2),
                        new.employee=c(4,6,2,6,23,2,5,34))

  id year month new.employee
1  A 2014     1            4
2  A 2014     2            6
3  A 2015     1            2
4  A 2015     2            6
5  B 2014     1           23
6  B 2014     2            2
7  B 2015     1            5
8  B 2015     2           34

and desired outcome with following functions: 具有以下功能的预期结果:

library(data.table) # V1.9.6+
temp <- setDT(df1)[month == 2L, .(id, frank(-new.employee)), by = year]
df1[temp, new.employee.rank := i.V2, on = c("year", "id")]
df1
#    id year month new.employee new.employee.rank
# 1:  A 2014     1            4                 1
# 2:  A 2014     2            6                 1
# 3:  A 2015     1            2                 2
# 4:  A 2015     2            6                 2
# 5:  B 2014     1           23                 2
# 6:  B 2014     2            2                 2
# 7:  B 2015     1            5                 1
# 8:  B 2015     2           34                 1

Now, I want to datamining by creating a user-defined function to varying the input, which is new.employee in above example. 现在,我想通过创建用户定义的函数来改变输入来进行数据挖掘,在上面的示例中该函数为new.employee。 I tried some ways but they did not work: 我尝试了一些方法,但是它们没有用:

  1. the first try: 第一次尝试:

     myRank <- function(data,var) { temp <- setDT(data)[month == 2L, .(id, frank(-var)), by = year] data[temp, new.employee.rank := i.V2, on = c("year", "id")] return(data) } myRank(df1,new.employee) 

    Error in is.data.frame(x) : object 'new.employee' not found is.data.frame(x)中的错误:找不到对象“ new.employee”

  2. the second try: 第二次尝试:

     myRank(df1,df1$new.employee) 

nothing appeared 什么都没出现

  1. The third try: I change the function a bit 第三次尝试:我稍微改变一下功能

     myRank <- function(data,var) { temp <- setDT(data)[month == 2L, .(id, rank(data$var)), by = year] data[temp, new.employee.rank := i.V2, on = c("year", "id")] return(data) } 

    myRank(df1,df1$new.employee) Warning messages: 1: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 2: In [.data.table (setDT(data), month == 2L, .(id, rank(data$var)), : Item 2 of j's result for group 1 is zero length. This will be filled with 2 NAs to match the longest column in this result. Later groups may have a similar problem but only the first is reported to save filling the warning buffer. 3: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' myRank(df1,df1 $ new.employee)警告消息:1:在is.na(x)中:is.na()应用于类型为'NULL'的非(列表或向量)2:在[.data.table (setDT(data),month == 2L,。(id,rank(data $ var)),):组1的j结果的第2项是零长度,将填充2个NA以匹配其中最长的列结果:后面的组可能有类似的问题,但只报告了第一个以节省警告缓冲区的填充量3:在is.na(x)中:is.na()应用于类型为'NULL的非(列表或向量) “

I looked at similar problems but my R experience is not good enough to understand those. 我看过类似的问题,但我的R经验不足以理解这些问题。

data.table uses a non standard evaluation by default (unless you start to mess around with with = FALSE ), and thus, you will need to refer to your column by name or alternatively use get . data.table默认情况下使用非标准评估(除非您开始with = FALSE来搞乱),因此,您将需要按名称引用列或使用get Another problem with your code (as mentioned in comments) is that you are calling new.employee while it's not defined outside of the scope of df1 . 代码的另一个问题(如注释中所述)是您正在调用new.employee ,但未在df1范围之外定义。 If you want prevent from R from evaluating it before you pass it to your data set, you could use the deparse(substitute(var)) combination which will prevent evaluation and then convert var to a character string which can in turn be passed to get or the eval(as.name()) combination (which do entirely different things but within the data.table scope will lead to the same result). 如果您希望阻止R在将其传递到数据集之前对其求值,则可以使用deparse(substitute(var))组合将阻止求值,然后将var转换为字符串,然后可以将该字符串传递给geteval(as.name())组合(虽然在data.table范围内执行的操作完全不同,但结果相同)。 Finally, there is the printing issue after using := within the function. 最后,在函数中使用:=后出现打印问题。 Even if everything works, return(data) won't do anything, you will need to force printing either by using an additional [] or by explicitly calling print 即使一切正常, return(data)也不做任何事情,您将需要通过使用附加的[]或显式调用print来强制进行print

Here's a possible solution 这是一个可能的解决方案

myRank <- function(data, var) {
  var <- deparse(substitute(var)) ## <~~~ Note this
  temp <- setDT(data)[month == 2L, .(id, frank(-get(var))), by = year] ## <~~ Note the get
  data[temp, new.employee.rank := i.V2, on = c("year", "id")][] ## <~~ Note the []
}       
myRank(df1, new.employee)
#    id year month new.employee new.employee.rank
# 1:  A 2014     1            4                 1
# 2:  A 2014     2            6                 1
# 3:  A 2015     1            2                 2
# 4:  A 2015     2            6                 2
# 5:  B 2014     1           23                 2
# 6:  B 2014     2            2                 2
# 7:  B 2015     1            5                 1
# 8:  B 2015     2           34                 1

Or 要么

myRank <- function(data, var) {
  var <- as.name(deparse(substitute(var))) ## <~~~ Note additional as.name
  temp <- setDT(data)[month == 2L, .(id, frank(-eval(var))), by = year] ## <~ Note the eval
  data[temp, new.employee.rank := i.V2, on = c("year", "id")][]
} 
myRank(df1, new.employee)
#    id year month new.employee new.employee.rank
# 1:  A 2014     1            4                 1
# 2:  A 2014     2            6                 1
# 3:  A 2015     1            2                 2
# 4:  A 2015     2            6                 2
# 5:  B 2014     1           23                 2
# 6:  B 2014     2            2                 2
# 7:  B 2015     1            5                 1
# 8:  B 2015     2           34                 1

I would guess the second option will be faster as it avoids extracting the whole column out of data 我猜第二个选项会更快,因为它避免了从data提取整个列


As a side note, you could also make the creation of the new variables names interactive by replacing 附带说明,您还可以通过替换新变量名称的创建来交互

new.employee.rank := i.V2

with something like 用类似的东西

paste0("New.", var, ".rank") := i.V2 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM