Calculate Mean of Multiply Columns with Condition in R

Question

I want to calculate mean of several variables but with condition, if 2 of those columns have NA, mean will be NA, if less than 2, find mean

df <- data.frame(ID = c(1:10),X1 = c(rep(1,5),rep(2,5)),X2 = c(1:10),X3 =   c(1,NA,2,NA,NA,1,NA,2,NA,NA),X4 = c(rep(NA,10)),X5=c(rep(1,5),rep(NA,5)),
             Y1 = c(rep(1,5),rep(2,5)),Y2 = c(1:10),Y3 = c(1,NA,2,NA,NA,1,NA,2,NA,NA),Y4 = c(rep(NA,10)),Y5=c(rep(1,5),rep(NA,5)))

MeanX = round(apply(df[,c(2:6)],1, mean,na.rm = TRUE),2)
MeanY = round(apply(df[,c(7:11)],1,mean,na.rm = TRUE),2)

This is output it's incorrect

   ID X1 X2 X3 X4 X5 Y1 Y2 Y3 Y4 Y5 MeanX MeanY
1   1  1  1  1 NA  1  1  1  1 NA  1  1.00  1.00
2   2  1  2 NA NA  1  1  2 NA NA  1  1.33  1.33*
3   3  1  3  2 NA  1  1  3  2 NA  1  1.75  1.75
4   4  1  4 NA NA  1  1  4 NA NA  1  2.00  2.00*
5   5  1  5 NA NA  1  1  5 NA NA  1  2.33  2.33*
6   6  2  6  1 NA NA  2  6  1 NA NA  3.00  3.00*
7   7  2  7 NA NA NA  2  7 NA NA NA  4.50  4.50 *
8   8  2  8  2 NA NA  2  8  2 NA NA  4.00  4.00 *
9   9  2  9 NA NA NA  2  9 NA NA NA  5.50  5.50 *
10 10  2 10 NA NA NA  2 10 NA NA NA  6.00  6.00 * This is supposed NA,bc there are 3 columns have NA

Because I have a large dataset, for each group sometimes I have to set 6 out of 20,sometimes 1 out of 10, so I can calculate mean, how I can set condition for this case.

Answer 1

Here is a VERY quick (have to run) and dirty solution with data.table . But I believe it can be cleaned and built upon to make something that is neat and works well.

# Load data.table
require(data.table)
setDT(df)

# Format all columns as as numeric, 
# otherwise mean is not meaningful (see what I did there?)
x.cols <- paste("X", 1:5, sep = "")
y.cols <- paste("Y", 1:5, sep = "")
setDT(df)[, (x.cols) := lapply(.SD, as.integer), .SDcols = x.cols]
setDT(df)[, (y.cols) := lapply(.SD, as.integer), .SDcols = y.cols]

# meanX first mean, and then NA
df[, meanX := mean(c(X1, X2, X3, X4, X5), na.rm = TRUE), by =ID]
df[df[, sum(is.na(c(X1, X2, X3, X4, X5))) > 2, by = ID]$V1, meanX := NA]

# meanY first mean, and then NA
df[, meanY := mean(c(Y1, Y2, Y3, Y4, Y5), na.rm = TRUE), by =ID]
df[df[, sum(is.na(c(Y1, Y2, Y3, Y4, Y5))) > 2, by = ID]$V1, meanY := NA]

# Result
df

    ID X1 X2 X3 X4 X5 Y1 Y2 Y3 Y4 Y5    meanX    meanY
 1:  1  1  1  1 NA  1  1  1  1 NA  1 1.000000 1.000000
 2:  2  1  2 NA NA  1  1  2 NA NA  1 1.333333 1.333333
 3:  3  1  3  2 NA  1  1  3  2 NA  1 1.750000 1.750000
 4:  4  1  4 NA NA  1  1  4 NA NA  1 2.000000 2.000000
 5:  5  1  5 NA NA  1  1  5 NA NA  1 2.333333 2.333333
 6:  6  2  6  1 NA NA  2  6  1 NA NA 3.000000 3.000000
 7:  7  2  7 NA NA NA  2  7 NA NA NA       NA       NA
 8:  8  2  8  2 NA NA  2  8  2 NA NA 4.000000 4.000000
 9:  9  2  9 NA NA NA  2  9 NA NA NA       NA       NA
10: 10  2 10 NA NA NA  2 10 NA NA NA       NA       NA

Answer 2

Here is a base R solution.

I think this is conceptually easier if you first go to long format, eg:

long <- reshape(df, idvar='ID', varying=colnames(df)[-1], timevar='t', sep='', direction='long')

which moves the variable subscripts into a variable t . it looks like this:

> str(long)
'data.frame':   50 obs. of  4 variables:
 $ ID: int  1 2 3 4 5 6 7 8 9 10 ...
 $ t : num  1 1 1 1 1 1 1 1 1 1 ...
 $ X : num  1 1 1 1 1 2 2 2 2 2 ...
 $ Y : num  1 1 1 1 1 2 2 2 2 2 ...
 - attr(*, "reshapeLong")=List of 4
  ..$ varying:List of 2
  .. ..$ X: chr  "X1" "X2" "X3" "X4" ...
  .. ..$ Y: chr  "Y1" "Y2" "Y3" "Y4" ...
  .. ..- attr(*, "v.names")= chr  "X" "Y"
  .. ..- attr(*, "times")= num  1 2 3 4 5
  ..$ v.names: chr  "X" "Y"
  ..$ idvar  : chr "ID"
  ..$ timevar: chr "t"

Then you can write an aggregate function fairly naturally based on your description. This one matches @snoram:

f <- function(x) if( sum(is.na(x)) > 2 ) NA else mean(x, na.rm=TRUE)

Note that the default behavior of aggregate is to skip NAs, but you can change that option:

aggregate(cbind(meanx=X,meany=Y)~ID, long, f, na.action=na.pass)

which gives:

   ID       meanx       meany
1   1 1.000000000 1.000000000
2   2 1.333333333 1.333333333
3   3 1.750000000 1.750000000
4   4 2.000000000 2.000000000
5   5 2.333333333 2.333333333
6   6 3.000000000 3.000000000
7   7          NA          NA
8   8 4.000000000 4.000000000
9   9          NA          NA
10 10          NA          NA

You can then cbind this back on to your original data frame if you like.

The advantage of this approach is that it should easily deal with X6, X7, and so on if you have those also.

EDIT:

Rereading your question, you might be better off tracking the mean and number of NAs seperately, then post-processing. Here is a quick and dirty example of doing so:

>     f <- function(x) c(sum(is.na(x)), mean(x, na.rm=TRUE))
> agg <-    aggregate(cbind(meanx=X,meany=Y)~ID, long, f, simplify=FALSE, na.action=na.pass);
> agg
   ID                    meanx                    meany
1   1                     1, 1                     1, 1
2   2 2.000000000, 1.333333333 2.000000000, 1.333333333
3   3               1.00, 1.75               1.00, 1.75
4   4                     2, 2                     2, 2
5   5 2.000000000, 2.333333333 2.000000000, 2.333333333
6   6                     2, 3                     2, 3
7   7                 3.0, 4.5                 3.0, 4.5
8   8                     2, 4                     2, 4
9   9                 3.0, 5.5                 3.0, 5.5
10 10                     3, 6                     3, 6
> g <- function(x, i) if(x[1] <= i) x[2] else NA
> mapply(lapply, agg[2:3],list(g), c(2,1))
   meanx       meany
01 1           1    
02 1.333333333 NA   
03 1.75        1.75 
04 2           NA   
05 2.333333333 NA   
06 3           NA   
07 NA          NA   
08 4           NA   
09 NA          NA   
10 NA          NA

That way, you can specify different numbers of NAs allowed for different columns. Apologies for nested applies.

Calculate Mean of Multiply Columns with Condition in R

Question

2 answers

solution1
3 ACCPTED 2016-08-30 01:07:28

solution2
2 2016-08-30 03:14:05

Calculate Mean of Multiply Columns with Condition in R

Question

2 answers

solution1 3 ACCPTED 2016-08-30 01:07:28

solution2 2 2016-08-30 03:14:05

solution1
3 ACCPTED 2016-08-30 01:07:28

solution2
2 2016-08-30 03:14:05