简体   繁体   English

如何删除包含超过 2000 个 NA 值的所有列?

[英]How to remove all columns that contain more than 2000 NA values?

I did look up a similar example which used我确实查找了一个使用过的类似示例

## Some sample data
set.seed(0)
dat <- matrix(1:100, 10, 10)
dat[sample(1:100, 50)] <- NA
dat <- data.frame(dat)
## Remove columns with more than 50% NA
dat[, -which(colMeans(is.na(dat)) > 0.5)]

But I am not sure how to convert it into a number and not a percentage.但我不确定如何将其转换为数字而不是百分比。

One base R option could be:一种base R选项可能是:

dat[, colMeans(is.na(dat)) <= 0.5]

   X1 X2 X4 X5 X6 X8 X10
1  NA 11 NA NA NA 71  NA
2  NA 12 32 NA 52 72  NA
3   3 NA 33 NA 53 73  93
4   4 14 NA 44 NA NA  94
5   5 15 35 NA 55 75  95
6  NA NA 36 46 NA 76  NA
7  NA NA NA 47 57 NA  97
8   8 18 NA 48 NA 78  98
9   9 NA 39 NA 59 79  99
10 NA NA 40 50 NA 80 100

Or using a specified number:或者使用指定的数字:

dat[, colSums(is.na(dat)) <= 5]

Or using half of the rows as a criteria:或者使用一半的行作为标准:

dat[, colSums(is.na(dat)) <= nrow(dat)/2]

And the same idea with dplyr :dplyr相同的想法:

dat %>%
 select_if(~ mean(is.na(.)) <= 0.5)

Or using a specified number:或者使用指定的数字:

dat %>%
 select_if(~ sum(is.na(.)) <= 5)

Similarly, using half of the rows as a criteria:同样,使用一半的行作为标准:

dat %>%
 select_if(~ sum(is.na(.)) <= length(.)/2)

或者你也可以计算它们:

dat[, -which(colSums(is.na(dat)) > 2000)]

Using purrr :使用purrr

purrr::discard(dat, ~sum(is.na(.x)) > 5)
   X1 X2 X3 X5 X6 X7 X8
1  NA 11 NA 41 NA 61 71
2  NA 12 NA NA 52 62 NA
3   3 13 23 NA 53 63 NA
4   4 NA NA NA NA NA NA
5   5 15 NA NA 55 65 NA
6  NA 16 26 46 56 66 76
7  NA 17 27 47 57 67 77
8   8 NA NA 48 58 NA 78
9   9 19 29 49 NA NA NA
10 10 NA 30 50 60 NA 80

Alternatively:或者:

purrr::keep(dat, ~sum(is.na(.x)) <= 5)
   X1 X2 X3 X5 X6 X7 X8
1  NA 11 NA 41 NA 61 71
2  NA 12 NA NA 52 62 NA
3   3 13 23 NA 53 63 NA
4   4 NA NA NA NA NA NA
5   5 15 NA NA 55 65 NA
6  NA 16 26 46 56 66 76
7  NA 17 27 47 57 67 77
8   8 NA NA 48 58 NA 78
9   9 19 29 49 NA NA NA
10 10 NA 30 50 60 NA 80

I multiplied it for 100 to keep it as percentage.我将它乘以 100 以保持百分比。 For you should look like this:因为你应该看起来像这样:

##Keep only the columns that their NA values are not greater than 50%

dat<-dat[(colMeans(is.na(dat)))*100 <= 50]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如果字符串中有超过 x 个数字或超过 x 个字母,如何用 NA 替换列中的所有情况? - How replace all cases in columns with NA if there are more than x numbers OR more than x letters in the string? 如果所有列都包含 NA,则删除行中的 NA 值,但保留至少包含一个结果的行 - dropping NA values in rows if all columns contain NA's but keep the rows which contain at least one result 删除跨列的值包含4个以上唯一字符中的2个的行 - Remove rows whose values across columns contain more than 2 of 4 unique characters 查找包含5个以上NA值的列的索引 - Find the index of columns containing more than 5 NA values 如何在忽略 NA 和空白的情况下处理 select 列中具有等于或大于 2 个唯一值的列? - How to select columns with equal or more than 2 unique values while ignoring NA and blank? 如何删除仅包含 NA 值的列 - How to remove columns full of only NA values 如何在R中删除值为90%以上的列为&#39;0&#39;的列 - How to remove columns with more than 90% values as '0' in R 删除多于x个负值的列 - Remove columns with more than x negative values 如果所选列中的所有值都返回 NA 作为结果,则删除数据框中的行 - Remove rows in a dataframe if ALL values in a selection of columns returns NA as result 从 dataframe 中删除所有值为 NA 的列 - Remove columns from dataframe where ALL values are NA
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM