简体   繁体   English

从 dataframe 中删除所有值为 NA 的列

[英]Remove columns from dataframe where ALL values are NA

I have a data frame where some of the columns contain NA values.我有一个数据框,其中一些列包含 NA 值。

How can I remove columns where all rows contain NA values?如何删除所有行都包含 NA 值的列?

尝试这个:

df <- df[,colSums(is.na(df))<nrow(df)]

The two approaches offered thus far fail with large data sets as (amongst other memory issues) they create is.na(df) , which will be an object the same size as df .迄今为止提供的两种方法在处理大数据集时都失败了,因为(在其他内存问题中)它们创建的是is.na(df) ,这将是一个与df大小相同的对象。

Here are two approaches that are more memory and time efficient这里有两种更节省内存和时间的方法

An approach using Filter一种使用Filter的方法

Filter(function(x)!all(is.na(x)), df)

and an approach using data.table (for general time and memory efficiency)和使用 data.table 的方法(用于一般时间和内存效率)

library(data.table)
DT <- as.data.table(df)
DT[,which(unlist(lapply(DT, function(x)!all(is.na(x))))),with=F]

examples using large data (30 columns, 1e6 rows)使用大数据的示例(30 列,1e6 行)

big_data <- replicate(10, data.frame(rep(NA, 1e6), sample(c(1:8,NA),1e6,T), sample(250,1e6,T)),simplify=F)
bd <- do.call(data.frame,big_data)
names(bd) <- paste0('X',seq_len(30))
DT <- as.data.table(bd)

system.time({df1 <- bd[,colSums(is.na(bd) < nrow(bd))]})
# error -- can't allocate vector of size ...
system.time({df2 <- bd[, !apply(is.na(bd), 2, all)]})
# error -- can't allocate vector of size ...
system.time({df3 <- Filter(function(x)!all(is.na(x)), bd)})
## user  system elapsed 
## 0.26    0.03    0.29 
system.time({DT1 <- DT[,which(unlist(lapply(DT, function(x)!all(is.na(x))))),with=F]})
## user  system elapsed 
## 0.14    0.03    0.18 

Update更新

You can now use select with the where selection helper.您现在可以将selectwhere selection helper 一起使用。 select_if is superceded, but still functional as of dplyr 1.0.2. select_if已被取代,但从 dplyr 1.0.2 开始仍然有效。 (thanks to @mcstrother for bringing this to attention). (感谢@mcstrother 引起人们的注意)。

library(dplyr)
temp <- data.frame(x = 1:5, y = c(1,2,NA,4, 5), z = rep(NA, 5))
not_all_na <- function(x) any(!is.na(x))
not_any_na <- function(x) all(!is.na(x))

> temp
  x  y  z
1 1  1 NA
2 2  2 NA
3 3 NA NA
4 4  4 NA
5 5  5 NA

> temp %>% select(where(not_all_na))
  x  y
1 1  1
2 2  2
3 3 NA
4 4  4
5 5  5

> temp %>% select(where(not_any_na))
  x
1 1
2 2
3 3
4 4
5 5

Old Answer旧答案

dplyr now has a select_if verb that may be helpful here: dplyr现在有一个select_if动词,在这里可能会有所帮助:

> temp
  x  y  z
1 1  1 NA
2 2  2 NA
3 3 NA NA
4 4  4 NA
5 5  5 NA

> temp %>% select_if(not_all_na)
  x  y
1 1  1
2 2  2
3 3 NA
4 4  4
5 5  5

> temp %>% select_if(not_any_na)
  x
1 1
2 2
3 3
4 4
5 5

Another way would be to use the apply() function.另一种方法是使用apply()函数。

If you have the data.frame如果你有 data.frame

df <- data.frame (var1 = c(1:7,NA),
                  var2 = c(1,2,1,3,4,NA,NA,9),
                  var3 = c(NA)
                  )

then you can use apply() to see which columns fulfill your condition and so you can simply do the same subsetting as in the answer by Musa, only with an apply approach.然后您可以使用apply()来查看哪些列满足您的条件,因此您可以简单地执行与 Musa 的答案相同的子集,仅使用apply方法。

> !apply (is.na(df), 2, all)
 var1  var2  var3 
 TRUE  TRUE FALSE 

> df[, !apply(is.na(df), 2, all)]
  var1 var2
1    1    1
2    2    2
3    3    1
4    4    3
5    5    4
6    6   NA
7    7   NA
8   NA    9

Late to the game but you can also use the janitor package.游戏迟到,但您也可以使用janitor包。 This function will remove columns which are all NA, and can be changed to remove rows that are all NA as well.此函数将删除全部为 NA 的列,并且可以更改为删除全部为 NA 的行。

df <- janitor::remove_empty(df, which = "cols")

df[sapply(df, function(x) all(is.na(x)))] <- NULL

Another options with purrr package: purrr包的另一个选项:

library(dplyr)

df <- data.frame(a = NA,
                 b = seq(1:5), 
                 c = c(rep(1, 4), NA))

df %>% purrr::discard(~all(is.na(.)))
df %>% purrr::keep(~!all(is.na(.)))

The accepted answer does not work with non-numeric columns. 接受的答案不适用于非数字列。 From this answer , the following works with columns containing different data types 这个答案 ,以下适用于包含不同数据类型的列

Filter(function(x) !all(is.na(x)), df)

You can use Janitor package remove_empty您可以使用 Janitor 包remove_empty

library(janitor)

df %>%
  remove_empty(c("rows", "cols")) #select either row or cols or both

Also, Another dplyr approach另外,另一种 dplyr 方法

 library(dplyr) 
 df %>% select_if(~all(!is.na(.)))

OR或者

df %>% select_if(colSums(!is.na(.)) == nrow(df))

this is also useful if you want to only exclude / keep column with certain number of missing values eg如果您只想排除/保留具有一定数量缺失值的列,这也很有用,例如

 df %>% select_if(colSums(!is.na(.))>500)

I hope this may also help.我希望这也能有所帮助。 It could be made into a single command, but I found it easier for me to read by dividing it in two commands.它可以变成一个命令,但我发现将它分成两个命令更容易阅读。 I made a function with the following instruction and worked lightning fast.我使用以下说明制作了一个功能,并且工作速度快如闪电。

naColsRemoval = function (DataTable) { na.cols = DataTable [ , .( which ( apply ( is.na ( .SD ) , 2 , all ) ) )] DataTable [ , unlist (na.cols) := NULL , with = F] }

.SD will allow to limit the verification to part of the table, if you wish, but it will take the whole table as如果您愿意,.SD 将允许将验证限制为表的一部分,但它会将整个表作为

一个方便的base R选项可能是colMeans()

df[, colMeans(is.na(df)) != 1]

From my experience of having trouble applying previous answers, I have found that I needed to modify their approach in order to achieve what the question here is:根据我在应用以前的答案时遇到问题的经验,我发现我需要修改他们的方法以实现这里的问题:

How to get rid of columns where for ALL rows the value is NA?如何摆脱所有行的值为 NA 的列?

First note that my solution will only work if you do not have duplicate columns (that issue is dealt with here (on stack overflow)首先请注意,我的解决方案仅在您没有重复列时才有效(该问题在此处处理(堆栈溢出)

Second, it uses dplyr .其次,它使用dplyr

Instead of代替

df <- df %>% select_if(~all(!is.na(.)))

I find that what works is我发现有效的是

df <- df %>% select_if(~!all(is.na(.)))

The point is that the "not" symbol "!"重点是“不是”符号“!” needs to be on the outside of the universal quantifier.需要在全称量词的外面。 Ie the select_if operator acts on columns.select_if运算符作用于列。 In this case, it selects only those that do not satisfy the criterion在这种情况下,只选择那些符合标准

every element is equal to "NA"每个元素都等于“NA”

An old question, but I think we can update @mnel's nice answer with a simpler data.table solution:一个老问题,但我认为我们可以用更简单的 data.table 解决方案更新@mnel 的好答案:

DT[, .SD, .SDcols = \\(x) !all(is.na(x))]

(I'm using the new \\(x) lambda function syntax available in R>=4.1, but really the key thing is to pass the logical subsetting through .SDcols . (我正在使用 R>=4.1 中可用的新\\(x) lambda 函数语法,但实际上关键是通过.SDcols传递逻辑子集。

Speed is equivalent.速度相当。

microbenchmark::microbenchmark(
  which_unlist  = DT[,which(unlist(lapply(DT, \(x) !all(is.na(x))))),with=F],
  sdcols = DT[, .SD, .SDcols = \(x) !all(is.na(x))],
  times = 2
)
#> Unit: milliseconds
#>          expr      min       lq     mean   median       uq      max neval cld
#>  which_unlist 51.32227 51.32227 56.78501 56.78501 62.24776 62.24776     2   a
#>        sdcols 43.14361 43.14361 49.33491 49.33491 55.52621 55.52621     2   a

看门人::remove_constant() 做得很好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM