简体   繁体   中英

Make R user-defined-function for data.table commands - How to refer a column properly

I have the df1 data

df1 <- data.frame(id=c("A","A","A","A","B","B","B","B"),
                        year=c(2014,2014,2015,2015),
                        month=c(1,2),
                        new.employee=c(4,6,2,6,23,2,5,34))

  id year month new.employee
1  A 2014     1            4
2  A 2014     2            6
3  A 2015     1            2
4  A 2015     2            6
5  B 2014     1           23
6  B 2014     2            2
7  B 2015     1            5
8  B 2015     2           34

and desired outcome with following functions:

library(data.table) # V1.9.6+
temp <- setDT(df1)[month == 2L, .(id, frank(-new.employee)), by = year]
df1[temp, new.employee.rank := i.V2, on = c("year", "id")]
df1
#    id year month new.employee new.employee.rank
# 1:  A 2014     1            4                 1
# 2:  A 2014     2            6                 1
# 3:  A 2015     1            2                 2
# 4:  A 2015     2            6                 2
# 5:  B 2014     1           23                 2
# 6:  B 2014     2            2                 2
# 7:  B 2015     1            5                 1
# 8:  B 2015     2           34                 1

Now, I want to datamining by creating a user-defined function to varying the input, which is new.employee in above example. I tried some ways but they did not work:

  1. the first try:

     myRank <- function(data,var) { temp <- setDT(data)[month == 2L, .(id, frank(-var)), by = year] data[temp, new.employee.rank := i.V2, on = c("year", "id")] return(data) } myRank(df1,new.employee) 

    Error in is.data.frame(x) : object 'new.employee' not found

  2. the second try:

     myRank(df1,df1$new.employee) 

nothing appeared

  1. The third try: I change the function a bit

     myRank <- function(data,var) { temp <- setDT(data)[month == 2L, .(id, rank(data$var)), by = year] data[temp, new.employee.rank := i.V2, on = c("year", "id")] return(data) } 

    myRank(df1,df1$new.employee) Warning messages: 1: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL' 2: In [.data.table (setDT(data), month == 2L, .(id, rank(data$var)), : Item 2 of j's result for group 1 is zero length. This will be filled with 2 NAs to match the longest column in this result. Later groups may have a similar problem but only the first is reported to save filling the warning buffer. 3: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'

I looked at similar problems but my R experience is not good enough to understand those.

data.table uses a non standard evaluation by default (unless you start to mess around with with = FALSE ), and thus, you will need to refer to your column by name or alternatively use get . Another problem with your code (as mentioned in comments) is that you are calling new.employee while it's not defined outside of the scope of df1 . If you want prevent from R from evaluating it before you pass it to your data set, you could use the deparse(substitute(var)) combination which will prevent evaluation and then convert var to a character string which can in turn be passed to get or the eval(as.name()) combination (which do entirely different things but within the data.table scope will lead to the same result). Finally, there is the printing issue after using := within the function. Even if everything works, return(data) won't do anything, you will need to force printing either by using an additional [] or by explicitly calling print

Here's a possible solution

myRank <- function(data, var) {
  var <- deparse(substitute(var)) ## <~~~ Note this
  temp <- setDT(data)[month == 2L, .(id, frank(-get(var))), by = year] ## <~~ Note the get
  data[temp, new.employee.rank := i.V2, on = c("year", "id")][] ## <~~ Note the []
}       
myRank(df1, new.employee)
#    id year month new.employee new.employee.rank
# 1:  A 2014     1            4                 1
# 2:  A 2014     2            6                 1
# 3:  A 2015     1            2                 2
# 4:  A 2015     2            6                 2
# 5:  B 2014     1           23                 2
# 6:  B 2014     2            2                 2
# 7:  B 2015     1            5                 1
# 8:  B 2015     2           34                 1

Or

myRank <- function(data, var) {
  var <- as.name(deparse(substitute(var))) ## <~~~ Note additional as.name
  temp <- setDT(data)[month == 2L, .(id, frank(-eval(var))), by = year] ## <~ Note the eval
  data[temp, new.employee.rank := i.V2, on = c("year", "id")][]
} 
myRank(df1, new.employee)
#    id year month new.employee new.employee.rank
# 1:  A 2014     1            4                 1
# 2:  A 2014     2            6                 1
# 3:  A 2015     1            2                 2
# 4:  A 2015     2            6                 2
# 5:  B 2014     1           23                 2
# 6:  B 2014     2            2                 2
# 7:  B 2015     1            5                 1
# 8:  B 2015     2           34                 1

I would guess the second option will be faster as it avoids extracting the whole column out of data


As a side note, you could also make the creation of the new variables names interactive by replacing

new.employee.rank := i.V2

with something like

paste0("New.", var, ".rank") := i.V2 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM