简体   繁体   English

用于在因子的每个级别对NA值进行计数的功能

[英]Function to count NA values at each level of a factor

I have this dataframe: 我有这个数据框:

set.seed(50)
data <- data.frame(age=c(rep("juv", 10), rep("ad", 10)),
                   sex=c(rep("m", 10), rep("f", 10)),
                   size=c(rep("large", 10), rep("small", 10)),
                   length=rnorm(20),
                   width=rnorm(20),
                   height=rnorm(20))

data$length[sample(1:20, size=8, replace=F)] <- NA
data$width[sample(1:20, size=8, replace=F)] <- NA
data$height[sample(1:20, size=8, replace=F)] <- NA

   age sex  size      length       width      height
1  juv   m large          NA -0.34992735  0.10955641
2  juv   m large -0.84160374          NA -0.41341885
3  juv   m large  0.03299794 -1.58987765          NA
4  juv   m large          NA          NA          NA
5  juv   m large -1.72760411          NA  0.09534935
6  juv   m large -0.27786453  2.66763339  0.49988990
7  juv   m large          NA          NA          NA
8  juv   m large -0.59091244 -0.36212039 -1.65840096
9  juv   m large          NA  0.56874633          NA
10 juv   m large          NA  0.02867454 -0.49068623
11  ad   f small  0.29520677  0.19902339          NA
12  ad   f small  0.55475223 -0.85142228  0.33763747
13  ad   f small          NA          NA -1.96590570
14  ad   f small  0.19573384  0.59724896 -2.32077461
15  ad   f small -0.45554055 -1.09604786          NA
16  ad   f small -0.36285547  0.01909655  1.16695158
17  ad   f small -0.15681338          NA          NA
18  ad   f small          NA          NA          NA
19  ad   f small          NA  0.40618657 -1.33263085
20  ad   f small -0.32342568          NA -0.13883976

I'm trying to make a function that counts the number of NA values of each of length , width and height at each level of the three factors in the dataframe. 我正在尝试创建一个函数来计算数据帧中三个因子的每个级别的lengthwidthheight的NA值的数量。 I've tried this: 我已经试过了:

 exploreMissingValues <- function(dataframe, factors, variables){
  library(plyr)
  Variables <- list(variables)

  llply(Variables, function(x) ddply(dataframe, .(factors), 
                                     summarise, 
                                     number.of.NA=length(x[is.na(x)])))  
}

exploreMissingValues(data, 
                     c("age", "sex", "size"), 
                     c("length", "width", "height"))

...but this gives an error. ...但这会导致错误。 How can I get this function to return number of NA values at each level of the dataframe? 如何获得此函数以返回数据帧每个级别的NA值数量?

Looking for something like this...??? 寻找这样的东西... ???

library(doBy)
summaryBy(length+width+height~age+sex+size,
          data=data,
          FUN=function(x) sum(is.na(x)),
          keep.names=TRUE)
  age sex  size length width height
1  ad   f small      3     4      4
2 juv   m large      5     4      4

Use aggregate : 使用aggregate

nacheck <- function(var, factor)
    aggregate(var, list(factor), function(x) sum(is.na(x)))

nacheck(data$length, data$age)
nacheck(data$length, data$sex)
nacheck(data$length, data$size)

You could also apply this to your dataframe, by each factor to get NA counts for all of the dimension measures for each factor. 您还apply按每个因子将此值应用于数据框,以获取每个因子的所有维度量度的NA计数。

apply(data[,c("length","width","height")], 2, nacheck, factor=data$age)
apply(data[,c("length","width","height")], 2, nacheck, factor=data$sex)
apply(data[,c("length","width","height")], 2, nacheck, factor=data$size)

To do this all as one function, nest nacheck in something and then lapply : 为了将所有功能作为一个功能来完成, nacheck嵌套在其中,然后lapply

exploreNA <- function(df, factors){
    nacheck <- function(var, factor)
        aggregate(var, list(factor), function(x) sum(is.na(x)))
    lapply(factors, function(x) apply(df, 2, nacheck, factor=x))
}

exploreNA(data[,c("length","width","height")], list(data$age, data$sex, data$size))

A data.table approach: 数据data.table方法:

library(data.table)
DT <- data.table(data)
DT[, lapply(.SD, function(x) sum(is.na(x))) , by = list(age,sex,size)]
##    age sex  size length width height
## 1: juv   m large      5     4      4
## 2:  ad   f small      3     4      4

and the plyr equivalent using colwise and ddply 以及使用colwiseddplyplyr等效ddply

ddply(data, .(age,sex,size), colwise(.fun = function(x) sum(is.na(x))))
##   age sex  size length width height
## 1  ad   f small      3     4      4
## 2 juv   m large      5     4      4

You could always use a vector of column names for the by components 您总是可以为by组件使用列名的向量

by.cols <- c('age', 'sex' ,'size')
# then the following will work....
DT[, lapply(.SD, function(x) sum(is.na(x))), by = by.cols]
ddply(data, by.cols, colwise(.fun = function(x) sum(is.na(x))))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM