简体   繁体   中英

[R]: applying a function to columns based on conditional row position

I am attempting to find the number of observations by column in a data frame that meet a certain condition after the max for that column has been encountered.

Here is a highly simplified example:

fake.dat<-data.frame(samp1=c(5,6,7,5,4,5,10,5,6,7), samp2=c(2,3,4,6,7,9,2,3,7,8), samp3=c(2,3,4,11,7,9,2,3,7,8),samp4=c(5,6,7,5,4,12,10,5,6,7))

       samp1 samp2 samp3 samp4
1      5     2     2     5
2      6     3     3     6
3      7     4     4     7
4      5     6    11     5
5      4     7     7     4
6      5     9     9    12
7     10     2     2    10
8      5     3     3     5
9      6     7     7     6
10     7     8     8     7

So, let's say I'm trying to find the number of observations per column that are greater than 5 after excluding all the observations in a column up to and including the row where the maximum for the column occurs.

Expected outcome:

samp1 samp2 samp3 samp4 
   2     2     4    3 

I am able to get the answer I want by using nested for loops to exclude the observations I don't want.

newfake.dat<-data.frame()

for(j in 1:length(fake.dat)){
for(i in 1:nrow(fake.dat)){
    ifelse(i>max.row[j],newfake.dat[i,j]<-fake.dat[i,j],"NA")
print(newfake.dat)
}}

This creates a new data frame on which I can run an easy apply function.

colcount<-apply(newfake.dat,2,function(x) (sum(x>5,na.rm=TRUE)))

   V1 V2 V3 V4
1  NA NA NA NA
2  NA NA NA NA
3  NA NA NA NA
4  NA NA NA NA
5  NA NA  7 NA
6  NA NA  9 NA
7  NA  2  2 10
8   5  3  3  5
9   6  7  7  6
10  7  8  8  7

V1 V2 V3 V4 
 2  2  4  3 

Which is all well and good for this tiny example dataset, but is prohibitively slow on anything approaching the size of my real datasets. Which are large (2000 x 2000 or larger) and numerous. I tried it with a truncated version of one of my files (fewer columns, but same number of rows) and it ran for at least 5 hours (I left it going when I left work for the day). Also, I don't really need the new dataframe for anything other than to be able to run the apply function.

Is there any way to do this more efficiently? I tried limiting the rows that the apply function works on by using seq and the row number of the max.

maxrow<-apply(fake.dat,2,function(x) which.max(x))
print(maxrow)

seq.att<-apply(fake.dat,2,function(x) {
    sum(x[which(seq(1,nrow(fake.dat))==(maxrow)):nrow(fake.dat)]>5,na.rm=TRUE)})

Which kicks up four instances of this warning message:

1: In seq(1, nrow(fake.dat)) == (maxrow) :
  longer object length is not a multiple of shorter object length

If I ignore the warning message and get the output anyway it doesn't give me the answer I expected:

samp1 samp2 samp3 samp4 
    2     3     3     3 

I also tried using a while function which kept cycling so I stopped it (I misplaced the code I tried for this).

So far the most promising result has come from the nested for loops , but I know it's terribly inefficient and I'm hoping that there's a better way. I'm still new to R, and I'm sure I'm tripping up on some syntax somewhere. Thanks in advance for any help you can provide!

Here is a way in dplyr to replicate the same process that you showed with base R

library(dplyr)
fake.dat %>% 
        summarise_each(funs(sum(.[(which.max(.)+1):n()]>5,
                na.rm=TRUE)))
#   samp1 samp2 samp3 samp4
#1     2     2     4     3

If you need it as two steps:

datNA <- fake.dat %>% 
               mutate_each(funs(replace(., seq_len(which.max(.)), NA)))

datNA %>% 
      summarise_each(funs(sum(.>5, na.rm=TRUE)))

Here's one approach using data.table :

library(data.table)
##
data <- data.frame(
  samp1=c(5,6,7,5,4,5,10,5,6,7), 
  samp2=c(2,3,4,6,7,9,2,3,7,8), 
  samp3=c(2,3,4,11,7,9,2,3,7,8),
  samp4=c(5,6,7,5,4,12,10,5,6,7))
##
Dt <- data.table(data)
##
R> Dt[,lapply(.SD,function(x){
    y <- x[(which.max(x)+1):.N]
    length(y[y>5])
  })
   samp1 samp2 samp3 samp4
1:     2     2     4     3

A single-liner in base R:

vapply(fake.dat,function(x) sum(x[(which.max(x)+1):length(x)]>5),1L)
#samp1 samp2 samp3 samp4 
#    2     2     4     3

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM