简体   繁体   English

用 R 替换 data.table 中的多个值

[英]replace multiple values in data.table with R

I have tried multiple ways to replace two values in a data table with NA.我尝试了多种方法来用 NA 替换数据表中的两个值。

The data are here .数据在这里 There are two values 9223372036854775807 and 2147483647 which I intend to replace with NA有两个值92233720368547758072147483647我打算用NA替换

library(data.table)

data <- fread("https://raw.githubusercontent.com/Deborah-Jia/Complete_Analysis_da2/main/eg1.csv", integer64 = "numeric")

I tried:我试过了:

data[data = 9223372036854775807|2147483647]

had error:有错误:

Error in [.data.table (data, , data = 9223372036854775808 | 2147483647, : unused argument (data = 9223372036854775808 | 2147483647) [.data.table中的错误(数据,,数据 = 9223372036854775808 | 2147483647,:未使用的参数(数据 = 9223372036854775808 | 2147483647)

I checked the structure of [i, j, by...] but couldn't find the cause.我检查了 [i, j, by...] 的结构,但找不到原因。 So, I use for loop instead:所以,我改用 for 循环:

# only these cols have 9223372036854775807 and 2147483647
special_col <- data %>% select(matches("price|size|room")) %>% colnames()

for ( icol in special_col) {
  data[icol == 9223372036854775807|2147483647, icol := NA] 
}

It didn't work as expected;它没有按预期工作; I can still find 2147483647 in the data table.我仍然可以在数据表中找到2147483647

I know I can use我知道我可以使用

data[total_room_count_high == 9223372036854775807|2147483647, total_room_count_high := NA] 

and replicate each column, but it is rather tiresome.并复制每一列,但这很烦人。

Before these methods, I also did across , filter_at and mapply combined with a function to process each column.在这些方法之前,我也做across cross 、 filter_atmapply结合function来处理每一列。 But as long as I put col inside data[ ] , then data.table would think col is a column name rather than a variable representing all columns.但是只要我把col放在data[ ]中,那么 data.table 就会认为col是列名而不是代表所有列的变量。

For comparison you should use == .为了比较,您应该使用== You can use |您可以使用| as -作为 -

data <- read.csv2("https://raw.githubusercontent.com/Deborah-Jia/Complete_Analysis_da2/main/eg1.csv")
data[data == 2147483647 | data == 9223372036854775807] <- NA
data

an approach using set使用set的方法

values <- c(9223372036854775807, 2147483647)
for(col in names(data)) set(data, i = which(data[[col]] %in% values), j = col, value = NA_integer_)

Note that 9223372036854775807 is not "really" in your data but is the way that integer64 sometimes represent NAs.请注意, 9223372036854775807在您的数据中并不是“真正”的,而是 integer64 有时表示 NA 的方式。

You can just replace the integer columns.您只需更换 integer 列即可。

data <- fread("https://raw.githubusercontent.com/Deborah-Jia/Complete_Analysis_da2/main/eg1.csv", integer64 = "numeric")
for (j in seq_along(data)) {
  vj <- .subset2(data, j)
  if (is.integer(vj)) {
    i <- which(vj == .Machine$integer.max)
    set(data, i = i, j = j, value = NA_integer_)
  }
}

Note that 2147483647 might be there to represent "positive infinity" using integers and so might be a better representation than NA .请注意, 2147483647可能在那里使用整数表示“正无穷大”,因此可能比NA更好。 (For example, if you want to filter on all properties above a certain price, these properties will be erroneously filtered if you replace these values with NA ). (例如,如果您想过滤高于某个价格的所有属性,如果您将这些值替换为NA ,这些属性将被错误地过滤)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM