在data.table R中使用lapply填充不适用的NA列

Question

I have a problem using lapply in a data.table. 我在lapply中使用lapply有问题。 Here are two examples: 这是两个示例：

library(data.table)
 library(lubridate)

test <- function(x) 
{
  if(is.na(x)) return(NA)
  if(x=="") return(NA)
  if(substr(x,3,3)=="/") return(as_date(x,"%d/%m/%Y"))
  return(2)
}

x1<-data.table(v1=c("","07/06/2016","",NA), v2=c("2004-06-18","","2004-06-18","2004-06-18"))
x1[,lapply(.SD,test)]

x2<-data.table(v1=c("2004-06-19","2004-06-18","",NA),v2=c("2004-06-18","","2004-06-18","2004-06-18"))
x2[,lapply(.SD,test)]

In the first example, the first column after the lapply is full of NA , but I wanted to obtain is NA, 2016-06-07, NA, NA . 在第一个示例中， lapply之后的第一列充满了NA ，但我想获取的是NA, 2016-06-07, NA, NA 。

In the second example, the last two rows of the first column are wrong, because each row contains 2 but in my opinion should contains NA . 在第二个示例中，第一列的最后两行是错误的，因为每行包含2，但我认为应该包含NA 。

I don't understand how R considers the NA here. 我不明白R在这里如何考虑NA 。 What do I miss to get what I want? 我想得到我想要的东西吗？

Answer 1

After a lot of tries, the answer is that data.table considers columns as variables, and .SD is a list whose elements are the columns as variables, and so when applying a function, as test here, this function must take as argument a list. 经过大量尝试，答案是data.table将列视为变量，而.SD是一个列表，其元素是列作为变量，因此在应用函数时（如此处测试），该函数必须将a作为参数清单。

Here is what you should change: 这是您应该更改的内容：

testList <- function(x) 
{
  lapply(x,test)
}

x1[,lapply(.SD,testList)]

If someone knows about another solution, please don't hesitate to share. 如果有人知道其他解决方案，请随时分享。

Answer 2

First, I can't run your example without throwing an error. 首先，我不能在没有抛出错误的情况下运行您的示例。 The second columns of your data.tables are of class "Date", but the "" entry isn't a date. data.tables的第二列属于“日期”类，但""项不是日期。 When it prints it's formatted to look like NA . 打印时，其格式看起来像NA 。 Try running is.na(x1$v2[2]) and x1$v2[2] == "" . 尝试运行is.na(x1$v2[2])和x1$v2[2] == "" 。

Also, it looks like you have a problem with vectorization. 同样，您似乎在向量化方面遇到了问题。

Try running test(x1$v1) . 尝试运行test(x1$v1) 。 Pay attention to the warning messages. 请注意警告消息。 is.na(x) returns a logical vector, but if only uses the first element in the vector. is.na(x)返回逻辑向量，但是if仅使用向量中的第一个元素。

In addition: Warning message:
In if (is.na(x)) return(NA) :
  the condition has length > 1 and only the first element will be used

You might be able to fix it by applying to each row: 您可能可以通过应用到每一行来修复它：

x1[, lapply(.SD, test), by = 1:nrow(x1)]

Otherwise you'll need to modify your test function to accept a vector of strings and return a vector of results. 否则，您将需要修改test函数以接受字符串向量并返回结果向量。 But you should really consider returning a vector of a single type. 但是，您实际上应该考虑返回单一类型的向量。

Finally, I don't understand the purpose of lubridate in this example. 最后，在此示例中，我不了解lubridate的目的。 Why not use as.Date(x,"%d/%m/%Y") . 为什么不使用as.Date(x,"%d/%m/%Y") 。 What do you gain from as_date ? 您从as_date获得什么？

Edit 编辑

You can rewrite your function to work on vectors: 您可以重写函数以处理矢量：

test <- function(x) 
{
  ans <- rep.int(2, length(x))
  ans[is.na(x) | x == ""] <- NA
  dates <- grepl('../', x)
  ans[dates] <- as_date(x[dates], "%d/%m/%Y") 

  return(ans)
}

在data.table R中使用lapply填充不适用的NA列

问题描述

2 个解决方案

解决方案1
0 2016-07-12 12:59:06

解决方案2
0 2016-07-12 13:00:06

Edit 编辑

在data.table R中使用lapply填充不适用的NA列

问题描述

2 个解决方案

解决方案1 0 2016-07-12 12:59:06

解决方案2 0 2016-07-12 13:00:06

Edit 编辑

解决方案1
0 2016-07-12 12:59:06

解决方案2
0 2016-07-12 13:00:06