[英]How can I remove observations from a data frame conditionally without losing NA values in R?
In the data frame there is a variable called YOB
. 在数据框中有一个名为YOB
的变量。 As you can see, there are 333 NA
values. 如您所见,有333个NA
值。
> summary(train$YOB)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1880 1970 1983 1980 1993 2039 333
I identified some outliers and want to get rid of them. 我发现了一些异常值并希望摆脱它们。 Anything less than 1900 and greater than 2003 shall be removed. 任何低于1900且大于2003的东西都应被删除。 I tried to do this by indexing. 我尝试通过索引来做到这一点。
train = train[which(train$YOB >= 1900 & train$YOB <= 2003),]
Unfortunately observations whose YOB
variable were NA
are also removed. 不幸的是, YOB
变量为NA
观测也被删除了。
> summary(train$YOB)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1900 1970 1983 1980 1993 2003
On a side note, I face the same problem when using subset
command. 另外,在使用subset
命令时我遇到了同样的问题。
> train = subset(train, YOB >= 1900 & YOB <= 2003)
> summary(train$YOB)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1900 1970 1983 1980 1993 2003
I have also tried to use this condition in both attempts, but with no success, eg 我也试图在两次尝试中使用这个条件,但没有成功,例如
> train = train[which(!is.na(train$YOB) & train$YOB >= 1900 & train$YOB <= 2003),]
> summary(train$YOB)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1900 1970 1983 1980 1993 2003
I would like to keep the observations that have NA
in the YOB
variable and only remove those that are numeric. 我想保留YOB
变量中包含NA
的观察结果,并仅删除那些数字变量。 The idea is in a second step to impute missing values. 这个想法是第二步,以弥补缺失的价值观。
The which
will give the numeric index and skip all those NA rows. 在which
将给数字指标,并跳过所有这些NA行。 To avoid that, use the logical index without wrapping with which
. 为了避免这种情况,使用逻辑索引,而不与包裹which
。 The index will be NA in that way and that row will remain as NA even if there are other values that are non-NA. 索引将以这种方式为NA,并且即使存在非NA的其他值,该行仍将保持为NA。
res1 <- train[train$YOB >= 1900 & train$YOB <= 2003,]
res1[is.na(res1$YOB),]
# YOB col2
#NA NA NA
The correct way would be to have another condition with is.na
正确的方法是使用is.na
获得另一个条件
res2 <- train[is.na(train$YOB)| (train$YOB >= 1900 & train$YOB <= 2003),]
res2[is.na(res2$YOB),]
# YOB col2
#42 NA 0.2258094
Using a simple example 用一个简单的例子
set.seed(25)
d1 <- data.frame(v1 = c(NA, 1, 5), v2 = rnorm(3))
d1$v1 >1
#[1] NA FALSE TRUE
Here, the NA
value remains as such. 这里, NA
值保持不变。 If we use which
如果我们使用which
which(d1$v1 >1)
#[1] 3
we get only the index of the TRUE values. 我们只得到TRUE值的索引。 According to OP, both the NA and the rows that satisfy the logical condition should return. 根据OP,NA和满足逻辑条件的行都应该返回。 In that case, 在这种情况下,
d1[is.na(d1$v1)|d1$v1 > 1,]
# v1 v2
#1 NA -0.2118336
#3 5 -1.1533076
set.seed(29)
train <- data.frame(YOB = sample(c(NA, 1850:2015), 100, replace=TRUE),
col2 = rnorm(100))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.