简体   繁体   English

如何用R中的应用系列函数替换多个嵌套的for循环?

[英]How to replace mutiple nested for loops with apply family functions in R?

I have four main variables in my dataset (dat).我的数据集 (dat) 中有四个主要变量。

  1. SubjectID主题ID
  2. Group (can be Easy1, Easy2, Hard1, Hard2)组(可以是 Easy1、Easy2、Hard1、Hard2)
  3. Object (x, y, z, w)对象(x、y、z、w)
  4. Reaction time反应时间

For each combination of variables 1, 2 and 3 I want to change the reaction time, so that all values above the 3rd Quartile + 1.5IQR are set to the value of 3rd Quartile + 1.5 IQR.对于变量 1、2 和 3 的每个组合,我想更改反应时间,以便将高于第三四分位数 + 1.5IQR 的所有值设置为第三四分位数 + 1.5 IQR 的值。

TUK <- function (a,b,c) {
....
}

Basically, the for loop logic would be:基本上,for循环逻辑是:

for (i in dat$SubjectID):
for (j in dat$Group):
for (k in dat$Object) :
TUK(i,j,k)

How can I do this with apply function family?如何使用 apply 函数系列来做到这一点?

Thank you!谢谢!

Adding reproducible example:添加可重现的示例:

SubjectID <- c(3772113,3772468)
Group <- c("Easy","Hard")
Object <- c("A","B")
dat <- data.frame(expand.grid(SubjectID,Group,Object))
dat$RT <- rnorm(8,1500,700)
colnames(dat) <- c("SubjectID","Group","Object","RT")

TUK <- function (SUBJ,GROUP,OBJECT){
  p <- dat[dat$SubjectID==SUBJ & dat$Group== GROUP & dat$Object==OBJECT, "RT"]

  p[p$RT< 1000 | p$RT> 2000,] <- NA

  dat[dat$SubjectID==SUBJ & dat$Group== GROUP & dat$Object==OBJECT, "RT"]<<- p
}

A big part of your problem is that your TUK function is terrible .您的问题的很大一部分是您的TUK功能很糟糕 Here are some reasons why以下是一些原因

  • Problem: it depends on having a data frame named dat in the global environment .问题:这取决于在全局环境中有一个名为dat的数据框 Change the name of your data and it breaks.更改数据的名称,它会中断。

    • Solution: you should pass in all arguments needed.解决方案:您应该传入所有需要的参数。 In this case, dat should be an argument.在这种情况下, dat应该是一个参数。
  • Problem: Global assignment <<- should be avoided .问题:应该避免全局赋值<<- There are certain advanced cases where it is necessary (eg, sometimes in Shiny apps), but in general it makes a function behave in very un-R-like ways.在某些高级情况下它是必要的(例如,有时在 Shiny 应用程序中),但通常它会使函数以非常非 R 的方式运行。

    • Solution: Simply return() a value and assign it like any other normal R function.解决方案:简单地return()一个值并像任何其他普通 R 函数一样分配它。
  • Problem: It's over-complicated.问题:它过于复杂。 You're by passing in SUBJ, GROUP, and OBJECT but only using them to subset you're trying to do inside your function the "grouping" bit that dplyr or data.table or base::ave excels at.您通过传入 SUBJ、GROUP 和 OBJECT,但仅使用它们来子集您尝试在函数内部执行的dplyrdata.tablebase::ave擅长的“分组”位。 It's as if you're trying to build you function in a way so that if could only possibly be used embedded in this particular for loop.就好像您试图以某种方式构建您的函数,以便 if 只能用于嵌入这个特定的for循环中。

    • Solution: Functions should be simple building blocks.解决方案:函数应该是简单的构建块。 Make this a function of just a single vector.使其成为仅单个向量的函数。 It will be much cleaner and easier to debug.它会更干净,更容易调试。 When it works on a single vector, use dplyr or data.table or ave (or even a for loop) to do the split-apply-combining of it.当它在单个向量上工作时,使用dplyrdata.tableave (甚至是for循环)对其进行拆分-应用-组合。 This also makes your function more generally useful instead of being cemented to this one particular case.这也使您的功能更普遍有用,而不是固定在这种特殊情况下。

With the above in mind, here's an attempted re-write:考虑到上述情况,这里尝试重写:

TUK2 <- function (RT){
  RT[RT < 1000 | RT > 2000] <- NA
  return(RT)
}

See how much simpler!看看有多简单! Now if we want to apply this function to each of the GROUP:SUBJ:OBJECT groupings in your data, and replace the RT column with the result, we do this with dplyr :现在,如果我们想将此函数应用于数据中的每个 GROUP:SUBJ:OBJECT 分组,并用结果替换 RT 列,我们使用dplyr执行此dplyr

library(dplyr)
group_by(dat, Group, SubjectID, Object) %>%
    mutate(new_RT = TUK2(RT))

dplyr does the grouping of data, the splitting of data, applies the simple function to each piece, and combines it all back together for us. dplyr对数据进行分组,数据的拆分,将简单的功能应用于每个部分,然后为我们将它们全部组合在一起。


Now, in your question, you said现在,在你的问题中,你说

For each combination of variables 1, 2 and 3 I want to change the reaction time, so that all values above the 3rd Quartile + 1.5IQR are set to the value of 3rd Quartile + 1.5 IQR.对于变量 1、2 和 3 的每个组合,我想更改反应时间,以便将高于第三四分位数 + 1.5IQR 的所有值设置为第三四分位数 + 1.5 IQR 的值。

This doesn't sound much like what your function does.这听起来不像你的函数所做的。 Based only on this description, I would code this as仅基于此描述,我会将其编码为

group_by(dat, Group, SubjectID, Object) %>%
    mutate(new_RT = pmin(RT, quantile(RT, probs = 0.75) + 1.5 * IQR(RT)))

pmin is for parallel minimum , it's a vectorized way to take the smaller of two vectors. pmin用于并行最小值,它是一种采用两个向量中较小者的向量化方式。 Try, eg, pmin(1:10, 7) , to see what it does.尝试,例如, pmin(1:10, 7) ,看看它做了什么。

In both examples, the dplyr data frame won't be saved, of course, unless you re-assign it with dat <- group_by(dat, ...) etc. This is the functional programming way of doing things - no global assignment.在这两个例子中, dplyr数据框当然不会被保存,除非你用dat <- group_by(dat, ...)等重新分配它。这是函数式编程的做事方式 -没有全局分配.


One additional note: with the re-written function you could still use loops instead of dplyr .附加说明:使用重写的函数,您仍然可以使用循环而不是dplyr I don't know why you would - surely the dplyr syntax is nicer - but I just want to illustrate that the small building-block function is generally useful, it's not "baking in" dplyr in the way that your original function was "baking in" a particular for loop.我不知道你为什么会 - 当然dplyr语法更好 - 但我只是想说明小的构建块函数通常很有用,它不是dplyr原始函数那样“烘焙” dplyr在”特定的 for 循环中。

for (sub %in% unique(dat$SubjectID)) {
  for (obj %in% unique(dat$Object)) {
    for (grp %in% unique(dat$Group)) {
      dat[dat$SubjectID == sub & 
            dat$Object == obj & 
            dat$Group == grp, "RT"] <-
        TUK2(
          dat[dat$SubjectID == sub & 
                dat$Object == obj & 
                dat$Group == grp, "RT"]
        )
    }
  }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM