简体   繁体   中英

column full of NA using lapply in a data.table R

I have a problem using lapply in a data.table. Here are two examples:

library(data.table)
 library(lubridate)

test <- function(x) 
{
  if(is.na(x)) return(NA)
  if(x=="") return(NA)
  if(substr(x,3,3)=="/") return(as_date(x,"%d/%m/%Y"))
  return(2)
}

x1<-data.table(v1=c("","07/06/2016","",NA), v2=c("2004-06-18","","2004-06-18","2004-06-18"))
x1[,lapply(.SD,test)]

x2<-data.table(v1=c("2004-06-19","2004-06-18","",NA),v2=c("2004-06-18","","2004-06-18","2004-06-18"))
x2[,lapply(.SD,test)]

In the first example, the first column after the lapply is full of NA , but I wanted to obtain is NA, 2016-06-07, NA, NA .

In the second example, the last two rows of the first column are wrong, because each row contains 2 but in my opinion should contains NA .

I don't understand how R considers the NA here. What do I miss to get what I want?

After a lot of tries, the answer is that data.table considers columns as variables, and .SD is a list whose elements are the columns as variables, and so when applying a function, as test here, this function must take as argument a list.

Here is what you should change:

testList <- function(x) 
{
  lapply(x,test)
}

x1[,lapply(.SD,testList)]

If someone knows about another solution, please don't hesitate to share.

First, I can't run your example without throwing an error. The second columns of your data.tables are of class "Date", but the "" entry isn't a date. When it prints it's formatted to look like NA . Try running is.na(x1$v2[2]) and x1$v2[2] == "" .

Also, it looks like you have a problem with vectorization.

Try running test(x1$v1) . Pay attention to the warning messages. is.na(x) returns a logical vector, but if only uses the first element in the vector.

In addition: Warning message:
In if (is.na(x)) return(NA) :
  the condition has length > 1 and only the first element will be used

You might be able to fix it by applying to each row:

x1[, lapply(.SD, test), by = 1:nrow(x1)]

Otherwise you'll need to modify your test function to accept a vector of strings and return a vector of results. But you should really consider returning a vector of a single type.

Finally, I don't understand the purpose of lubridate in this example. Why not use as.Date(x,"%d/%m/%Y") . What do you gain from as_date ?

Edit

You can rewrite your function to work on vectors:

test <- function(x) 
{
  ans <- rep.int(2, length(x))
  ans[is.na(x) | x == ""] <- NA
  dates <- grepl('../', x)
  ans[dates] <- as_date(x[dates], "%d/%m/%Y") 

  return(ans)
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM