简体   繁体   English

[R]:根据条件行位置将函数应用于列

[英][R]: applying a function to columns based on conditional row position

I am attempting to find the number of observations by column in a data frame that meet a certain condition after the max for that column has been encountered. 我试图在遇到该列的最大值后,在数据框中找到满足某个条件的列的观察数。

Here is a highly simplified example: 这是一个高度简化的例子:

fake.dat<-data.frame(samp1=c(5,6,7,5,4,5,10,5,6,7), samp2=c(2,3,4,6,7,9,2,3,7,8), samp3=c(2,3,4,11,7,9,2,3,7,8),samp4=c(5,6,7,5,4,12,10,5,6,7))

       samp1 samp2 samp3 samp4
1      5     2     2     5
2      6     3     3     6
3      7     4     4     7
4      5     6    11     5
5      4     7     7     4
6      5     9     9    12
7     10     2     2    10
8      5     3     3     5
9      6     7     7     6
10     7     8     8     7

So, let's say I'm trying to find the number of observations per column that are greater than 5 after excluding all the observations in a column up to and including the row where the maximum for the column occurs. 因此,假设在排除列中的所有观察结果(包括列的最大值出现的行)之后,我试图找到每列大于5的观察数。

Expected outcome: 预期结果:

samp1 samp2 samp3 samp4 
   2     2     4    3 

I am able to get the answer I want by using nested for loops to exclude the observations I don't want. 通过使用嵌套for loops来排除我不想要的观察,我能够得到我想要的答案。

newfake.dat<-data.frame()

for(j in 1:length(fake.dat)){
for(i in 1:nrow(fake.dat)){
    ifelse(i>max.row[j],newfake.dat[i,j]<-fake.dat[i,j],"NA")
print(newfake.dat)
}}

This creates a new data frame on which I can run an easy apply function. 这将创建一个新的数据框,我可以在其上运行简单的apply功能。

colcount<-apply(newfake.dat,2,function(x) (sum(x>5,na.rm=TRUE)))

   V1 V2 V3 V4
1  NA NA NA NA
2  NA NA NA NA
3  NA NA NA NA
4  NA NA NA NA
5  NA NA  7 NA
6  NA NA  9 NA
7  NA  2  2 10
8   5  3  3  5
9   6  7  7  6
10  7  8  8  7

V1 V2 V3 V4 
 2  2  4  3 

Which is all well and good for this tiny example dataset, but is prohibitively slow on anything approaching the size of my real datasets. 对于这个微小的示例数据集来说,这一切都很好,但是对于接近我的真实数据集大小的任何东西来说都非常慢。 Which are large (2000 x 2000 or larger) and numerous. 哪个大(2000 x 2000或更大)和众多。 I tried it with a truncated version of one of my files (fewer columns, but same number of rows) and it ran for at least 5 hours (I left it going when I left work for the day). 我用我的一个文件的截断版本(较少的列,但行数相同)尝试了它并且它运行了至少5个小时(当我离开工作时我离开了它)。 Also, I don't really need the new dataframe for anything other than to be able to run the apply function. 此外,除了能够运行apply函数之外,我并不需要新的数据帧。

Is there any way to do this more efficiently? 有没有办法更有效地做到这一点? I tried limiting the rows that the apply function works on by using seq and the row number of the max. 我尝试使用seq和max的行号来限制apply函数的行。

maxrow<-apply(fake.dat,2,function(x) which.max(x))
print(maxrow)

seq.att<-apply(fake.dat,2,function(x) {
    sum(x[which(seq(1,nrow(fake.dat))==(maxrow)):nrow(fake.dat)]>5,na.rm=TRUE)})

Which kicks up four instances of this warning message: 这将启动此警告消息的四个实例:

1: In seq(1, nrow(fake.dat)) == (maxrow) :
  longer object length is not a multiple of shorter object length

If I ignore the warning message and get the output anyway it doesn't give me the answer I expected: 如果我忽略警告信息并获得输出,它不会给我我预期的答案:

samp1 samp2 samp3 samp4 
    2     3     3     3 

I also tried using a while function which kept cycling so I stopped it (I misplaced the code I tried for this). 我也试过使用while函数来保持循环,所以我停止了它(我放错了我为此尝试的代码)。

So far the most promising result has come from the nested for loops , but I know it's terribly inefficient and I'm hoping that there's a better way. 到目前为止,最有希望的结果来自嵌套的for loops ,但我知道它非常低效,我希望有更好的方法。 I'm still new to R, and I'm sure I'm tripping up on some syntax somewhere. 我还是R的新手,我确定我在某处捣乱某些语法。 Thanks in advance for any help you can provide! 提前感谢您提供的任何帮助!

Here is a way in dplyr to replicate the same process that you showed with base R 这是dplyr中复制与base R显示的相同过程的一种方法

library(dplyr)
fake.dat %>% 
        summarise_each(funs(sum(.[(which.max(.)+1):n()]>5,
                na.rm=TRUE)))
#   samp1 samp2 samp3 samp4
#1     2     2     4     3

If you need it as two steps: 如果您需要它作为两个步骤:

datNA <- fake.dat %>% 
               mutate_each(funs(replace(., seq_len(which.max(.)), NA)))

datNA %>% 
      summarise_each(funs(sum(.>5, na.rm=TRUE)))

Here's one approach using data.table : 这是使用data.table的一种方法:

library(data.table)
##
data <- data.frame(
  samp1=c(5,6,7,5,4,5,10,5,6,7), 
  samp2=c(2,3,4,6,7,9,2,3,7,8), 
  samp3=c(2,3,4,11,7,9,2,3,7,8),
  samp4=c(5,6,7,5,4,12,10,5,6,7))
##
Dt <- data.table(data)
##
R> Dt[,lapply(.SD,function(x){
    y <- x[(which.max(x)+1):.N]
    length(y[y>5])
  })
   samp1 samp2 samp3 samp4
1:     2     2     4     3

A single-liner in base R: base R中的单线程:

vapply(fake.dat,function(x) sum(x[(which.max(x)+1):length(x)]>5),1L)
#samp1 samp2 samp3 samp4 
#    2     2     4     3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM