如何根据R中有效列数（NA除外）选择数据框中的某些列？

Question

I'm using R, and I have a dataframe with multiple columns. 我正在使用R，并且有一个包含多列的数据框。 I want to run a code and automatically check the number of values (valid values, not NA) in each column. 我想运行代码并自动检查每列中的值数（有效值，不是NA）。 Then, it should select the columns that 50% of its rows are filled by valid values, and save them in a new dataframe. 然后，应选择其50％的行由有效值填充的列，并将其保存在新的数据框中。

Can anybody help me doing this? 有人可以帮我这样做吗？ Thank you very much. 非常感谢你。

Is there any way that the codes can be applied for an uncertain number of columns? 有什么方法可以将代码应用于不确定的列数？

Answer 1

Using purrr package, you can write function below to check for the percentage of missing values: 使用purrr包，您可以编写以下函数来检查缺失值的百分比：

pct_missing <- purrr::map_dbl(df,~mean(is.na(.x)))

After that, you can select those columns that have less than 50% missing values by their names. 之后，您可以选择名称缺失值少于50％的那些列。

selected_column <- colnames(df)[pct_missing < 0.5]

To create a new dataset, you may use: 要创建新的数据集，您可以使用：

library(dplyr)
df_new <- df %>% select(one_of(selected_column))

Answer 2

You can create a function within R base also to automatically retrieve the colums matching the critria: 您还可以在R base中创建一个函数，以自动检索与critria匹配的列：

Function: 功能：

ColSel <- function(df){
vals <- apply(df,2, function(fo) mean(is.na(fo))) < .5
return(df[,vals])
}

Some toy data 一些玩具数据

## example
df1 <- data.frame(
    a = c(runif(19),NA),
    b = c(rep(NA,11),runif(9)),
    d = rep(NA,20),
    e = runif(20)
    )

Test 测试

df2 <- ColSel(df1)

如何根据R中有效列数（NA除外）选择数据框中的某些列？

问题描述

2 个解决方案

解决方案1
1 2018-06-04 02:25:59

解决方案2
0 2018-06-04 07:25:32

如何根据R中有效列数（NA除外）选择数据框中的某些列？

问题描述

2 个解决方案

解决方案1 1 2018-06-04 02:25:59

解决方案2 0 2018-06-04 07:25:32

解决方案1
1 2018-06-04 02:25:59

解决方案2
0 2018-06-04 07:25:32