简体   繁体   English

在data.table中删除因子级别

[英]Removing factor level in data.table

A public dataset contains a factor level (eg, "(0) Omitted"), that I would like to recode as an NA. 公开数据集包含一个因子级别(例如,“(0)省略”),我想将其重新编码为NA。 Ideally, I'd like to be able to scrub an entire subset at once. 理想情况下,我希望能够一次清理整个子集。 I'm using the data.table package and am wondering if there is a better or faster way of accomplishing this than converting the values to characters, dropping the character, and then converting the data to factors. 我正在使用data.table包,并且想知道是否有比将值转换为字符,删除字符然后将数据转换为因子更好或更快的方法。

library(data.table)
DT <- data.table(V1=factor(sample(LETTERS,size = 2000000,replace=TRUE)),
                V2 = factor(sample(LETTERS,size = 2000000,replace=TRUE)),
                V3 = factor(sample(LETTERS,size = 2000000,replace=TRUE)))

# Convert to character
DT1 <- DT[, lapply(.SD, as.character)]
DT2 <- copy(DT1)
DT3 <- copy(DT) # Needs to be factor

# Scrub all 'B' values
DT1$V1[DT1$V1=="B"] <- NA
# Works!

DT2[V1 == "B", V1 := NA]
# Warning message:
#   In `[.data.table`(DT, V1 == "B", `:=`(V1, NA)) :
#   Coerced 'logical' RHS to 'character' to match the column's type. Either change the target column to 'logical' first (by creating a new 'logical' vector length 26 (nrows of entire table) and assign that; i.e. 'replace' column), or coerce RHS to 'character' (e.g. 1L, NA_[real|integer]_, as.*, etc) to make your intent clear and for speed. Or, set the column type correctly up front when you create the table and stick to it, please.

identical(DT1,DT2)
# [1] TRUE

# First attempt at looping over data.table
cnames <- colnames(DT3)
system.time(for(cname in cnames) {
  DT3[ ,
      cname := gsub("B", NA, DT3[[cname]]),
      with=FALSE]
})
# user  system elapsed 
# 4.258   0.128   4.478 

identical(DT1$V1,DT3$V1)
# [1] TRUE

# Back to factors
DT3 <- DT3[, lapply(.SD, as.factor)]

Set the factor level to NA: 将因子级别设置为NA:

levels(DT$V1)[levels(DT$V1) == 'B'] <- NA

Example: 例:

> d <- data.table(l=factor(LETTERS[1:3]))
> d
   l
1: A
2: B
3: C
> levels(d$l)[levels(d$l) == 'B'] <- NA
> d
    l
1:  A
2: NA
3:  C
> levels(d$l)
[1] "A" "C"

You can change the levels as follows: 您可以按以下方式更改级别:

for (j in seq_along(DT)) {
    x  = DT[[j]]
    lx = levels(x)
    lx[lx == "B"] = NA
    setattr(x, 'levels', lx)      ## reset levels by reference
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM