[英]How can I select certain columns in a dataframe based on their number of valid values (except NA) in R?
I'm using R, and I have a dataframe with multiple columns. 我正在使用R,并且有一个包含多列的数据框。 I want to run a code and automatically check the number of values (valid values, not NA) in each column.
我想运行代码并自动检查每列中的值数(有效值,不是NA)。 Then, it should select the columns that 50% of its rows are filled by valid values, and save them in a new dataframe.
然后,应选择其50%的行由有效值填充的列,并将其保存在新的数据框中。
Can anybody help me doing this? 有人可以帮我这样做吗? Thank you very much.
非常感谢你。
Is there any way that the codes can be applied for an uncertain number of columns? 有什么方法可以将代码应用于不确定的列数?
Using purrr
package, you can write function below to check for the percentage of missing values: 使用
purrr
包,您可以编写以下函数来检查缺失值的百分比:
pct_missing <- purrr::map_dbl(df,~mean(is.na(.x)))
After that, you can select those columns that have less than 50% missing values by their names. 之后,您可以选择名称缺失值少于50%的那些列。
selected_column <- colnames(df)[pct_missing < 0.5]
To create a new dataset, you may use: 要创建新的数据集,您可以使用:
library(dplyr)
df_new <- df %>% select(one_of(selected_column))
You can create a function within R base also to automatically retrieve the colums matching the critria: 您还可以在R base中创建一个函数,以自动检索与critria匹配的列:
Function: 功能:
ColSel <- function(df){
vals <- apply(df,2, function(fo) mean(is.na(fo))) < .5
return(df[,vals])
}
Some toy data 一些玩具数据
## example
df1 <- data.frame(
a = c(runif(19),NA),
b = c(rep(NA,11),runif(9)),
d = rep(NA,20),
e = runif(20)
)
Test 测试
df2 <- ColSel(df1)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.