在data.table中删除因子级别

Question

A public dataset contains a factor level (eg, "(0) Omitted"), that I would like to recode as an NA. 公开数据集包含一个因子级别（例如，“（0）省略”），我想将其重新编码为NA。 Ideally, I'd like to be able to scrub an entire subset at once. 理想情况下，我希望能够一次清理整个子集。 I'm using the data.table package and am wondering if there is a better or faster way of accomplishing this than converting the values to characters, dropping the character, and then converting the data to factors. 我正在使用data.table包，并且想知道是否有比将值转换为字符，删除字符然后将数据转换为因子更好或更快的方法。

library(data.table)
DT <- data.table(V1=factor(sample(LETTERS,size = 2000000,replace=TRUE)),
                V2 = factor(sample(LETTERS,size = 2000000,replace=TRUE)),
                V3 = factor(sample(LETTERS,size = 2000000,replace=TRUE)))

# Convert to character
DT1 <- DT[, lapply(.SD, as.character)]
DT2 <- copy(DT1)
DT3 <- copy(DT) # Needs to be factor

# Scrub all 'B' values
DT1$V1[DT1$V1=="B"] <- NA
# Works!

DT2[V1 == "B", V1 := NA]
# Warning message:
#   In `[.data.table`(DT, V1 == "B", `:=`(V1, NA)) :
#   Coerced 'logical' RHS to 'character' to match the column's type. Either change the target column to 'logical' first (by creating a new 'logical' vector length 26 (nrows of entire table) and assign that; i.e. 'replace' column), or coerce RHS to 'character' (e.g. 1L, NA_[real|integer]_, as.*, etc) to make your intent clear and for speed. Or, set the column type correctly up front when you create the table and stick to it, please.

identical(DT1,DT2)
# [1] TRUE

# First attempt at looping over data.table
cnames <- colnames(DT3)
system.time(for(cname in cnames) {
  DT3[ ,
      cname := gsub("B", NA, DT3[[cname]]),
      with=FALSE]
})
# user  system elapsed 
# 4.258   0.128   4.478 

identical(DT1$V1,DT3$V1)
# [1] TRUE

# Back to factors
DT3 <- DT3[, lapply(.SD, as.factor)]

Answer 1

Set the factor level to NA: 将因子级别设置为NA：

levels(DT$V1)[levels(DT$V1) == 'B'] <- NA

Example: 例：

> d <- data.table(l=factor(LETTERS[1:3]))
> d
   l
1: A
2: B
3: C
> levels(d$l)[levels(d$l) == 'B'] <- NA
> d
    l
1:  A
2: NA
3:  C
> levels(d$l)
[1] "A" "C"

Answer 2

You can change the levels as follows: 您可以按以下方式更改级别：

for (j in seq_along(DT)) {
    x  = DT[[j]]
    lx = levels(x)
    lx[lx == "B"] = NA
    setattr(x, 'levels', lx)      ## reset levels by reference
}

在data.table中删除因子级别

问题描述

2 个解决方案

解决方案1
2 已采纳 2014-05-04 00:17:20

解决方案2
2 2014-05-04 00:22:37

在data.table中删除因子级别

问题描述

2 个解决方案

解决方案1 2 已采纳 2014-05-04 00:17:20

解决方案2 2 2014-05-04 00:22:37

解决方案1
2 已采纳 2014-05-04 00:17:20

解决方案2
2 2014-05-04 00:22:37