使用列名向量在循环中重新编码data.table值

Question

I have a large data table that contains some categorical variables, where missing values have been coded as blank strings. 我有一个包含一些分类变量的大数据表，其中缺失值已编码为空白字符串。 I would like to recode them to NA. 我想将它们重新编码为NA。

I have a vector storing the names of the categorical variables: 我有一个存储分类变量名称的向量：

categorical_variables = c("v3", etc.

The vector is definitely set up correctly - I have successfully used it to loop through plots of each column. 向量肯定设置正确-我已经成功地使用它遍历了每一列的图。 However when I try to recode using this... 但是，当我尝试使用此代码进行重新编码时...

for (v in categorical_variables) myDataTable[get(v)=="",get(v):=NA]

...I get the following error: ...我收到以下错误：

 Error in get(v) : object 'v3' not found

Yet this works OK: 但这行得通：

myDataTable[v3=="",v3:=NA]

And this also works OK: 这也可以正常工作：

myDataTable[get("v3")=="",get("v3")]

So it's when I try to do the assignment using get() combined with := it throws up the error. 因此，当我尝试使用get（）与：=结合进行赋值时，会引发错误。 What am I doing wrong? 我究竟做错了什么？

The data.table is very large (hence my preference for using data.table), so ideally I don't want to convert to data.frame and use a base R approach. data.table非常大（因此，我偏爱使用data.table），因此理想情况下，我不想转换为data.frame并使用基本R方法。 I feel like this should be a very straightforward procedure in data.table, but I've really struggled to find anything conclusive in the documentation, on Google, or on here! 我觉得这在data.table中应该是一个非常简单的过程，但是我真的很难在文档，Google或此处找到任何结论性的东西！ Is this a bug or am I missing something obvious? 这是一个错误还是我遗漏了一些明显的东西？

Answer 1

We can use set . 我们可以使用set 。 According to ?set , it is very fast as the overhead of [.data.table is avoided 根据?set ，它非常快，因为避免了[.data.table的开销

library(data.table)
for (v in categorical_variables){
   set(myDataTable, i=which(myDataTable[[v]]==""), j=v, value=NA)
 }

However, this can be avoided while reading itself, as fread has the na.strings option (just like read.csv/read.table ). 但是，在读取自身时可以避免这种情况，因为fread具有na.strings选项（就像read.csv/read.table ）。 We can specify the characters that needs to be read as NA ie if we have "" and $ to read as NA , 我们可以将需要读取的字符指定为NA，即如果我们有""和$来读取为NA ，

myDataTable <- fread("yourfile.csv", na.strings=c("", "$"))

data 数据

myDataTable <- data.table(v3=c(letters[1:3], ''), 
        v5 = 1:4, v7 = c('', '', letters[1:2]))

使用列名向量在循环中重新编码data.table值

问题描述

1 个解决方案

解决方案1
2 已采纳 2016-03-12 22:22:25

data 数据

使用列名向量在循环中重新编码data.table值

问题描述

1 个解决方案

解决方案1 2 已采纳 2016-03-12 22:22:25

data 数据

解决方案1
2 已采纳 2016-03-12 22:22:25