简体   繁体   English

如何在不丢失R中的NA值的情况下有条件地从数据帧中删除观测值?

[英]How can I remove observations from a data frame conditionally without losing NA values in R?

In the data frame there is a variable called YOB . 在数据框中有一个名为YOB的变量。 As you can see, there are 333 NA values. 如您所见,有333个NA值。

> summary(train$YOB)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   1880    1970    1983    1980    1993    2039     333 

I identified some outliers and want to get rid of them. 我发现了一些异常值并希望摆脱它们。 Anything less than 1900 and greater than 2003 shall be removed. 任何低于1900且大于2003的东西都应被删除。 I tried to do this by indexing. 我尝试通过索引来做到这一点。

train = train[which(train$YOB >= 1900 & train$YOB <= 2003),]

Unfortunately observations whose YOB variable were NA are also removed. 不幸的是, YOB变量为NA观测也被删除了。

> summary(train$YOB)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1900    1970    1983    1980    1993    2003 

On a side note, I face the same problem when using subset command. 另外,在使用subset命令时我遇到了同样的问题。

> train = subset(train, YOB >= 1900 & YOB <= 2003)
> summary(train$YOB)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1900    1970    1983    1980    1993    2003 

I have also tried to use this condition in both attempts, but with no success, eg 我也试图在两次尝试中使用这个条件,但没有成功,例如

> train = train[which(!is.na(train$YOB) & train$YOB >= 1900 & train$YOB <= 2003),]
> summary(train$YOB)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1900    1970    1983    1980    1993    2003 

I would like to keep the observations that have NA in the YOB variable and only remove those that are numeric. 我想保留YOB变量中包含NA的观察结果,并仅删除那些数字变量。 The idea is in a second step to impute missing values. 这个想法是第二步,以弥补缺失的价值观。

The which will give the numeric index and skip all those NA rows. which将给数字指标,并跳过所有这些NA行。 To avoid that, use the logical index without wrapping with which . 为了避免这种情况,使用逻辑索引,而不与包裹which The index will be NA in that way and that row will remain as NA even if there are other values that are non-NA. 索引将以这种方式为NA,并且即使存在非NA的其他值,该行仍将保持为NA。

res1 <- train[train$YOB >= 1900 & train$YOB <= 2003,]
res1[is.na(res1$YOB),]
#   YOB col2
#NA  NA   NA

The correct way would be to have another condition with is.na 正确的方法是使用is.na获得另一个条件

res2 <- train[is.na(train$YOB)| (train$YOB >= 1900 & train$YOB <= 2003),]
res2[is.na(res2$YOB),]
#   YOB      col2
#42  NA 0.2258094

Using a simple example 用一个简单的例子

set.seed(25)
d1 <- data.frame(v1 = c(NA, 1, 5), v2 = rnorm(3))
d1$v1 >1
#[1]    NA FALSE  TRUE

Here, the NA value remains as such. 这里, NA值保持不变。 If we use which 如果我们使用which

which(d1$v1 >1)
#[1] 3

we get only the index of the TRUE values. 我们只得到TRUE值的索引。 According to OP, both the NA and the rows that satisfy the logical condition should return. 根据OP,NA和满足逻辑条件的行都应该返回。 In that case, 在这种情况下,

d1[is.na(d1$v1)|d1$v1 > 1,]
# v1         v2
#1 NA -0.2118336
#3  5 -1.1533076

data 数据

set.seed(29)
train <- data.frame(YOB = sample(c(NA, 1850:2015), 100, replace=TRUE), 
           col2 = rnorm(100))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM