I have four main variables in my dataset (dat).
For each combination of variables 1, 2 and 3 I want to change the reaction time, so that all values above the 3rd Quartile + 1.5IQR are set to the value of 3rd Quartile + 1.5 IQR.
TUK <- function (a,b,c) {
....
}
Basically, the for loop logic would be:
for (i in dat$SubjectID):
for (j in dat$Group):
for (k in dat$Object) :
TUK(i,j,k)
How can I do this with apply function family?
Thank you!
Adding reproducible example:
SubjectID <- c(3772113,3772468)
Group <- c("Easy","Hard")
Object <- c("A","B")
dat <- data.frame(expand.grid(SubjectID,Group,Object))
dat$RT <- rnorm(8,1500,700)
colnames(dat) <- c("SubjectID","Group","Object","RT")
TUK <- function (SUBJ,GROUP,OBJECT){
p <- dat[dat$SubjectID==SUBJ & dat$Group== GROUP & dat$Object==OBJECT, "RT"]
p[p$RT< 1000 | p$RT> 2000,] <- NA
dat[dat$SubjectID==SUBJ & dat$Group== GROUP & dat$Object==OBJECT, "RT"]<<- p
}
A big part of your problem is that your TUK
function is terrible . Here are some reasons why
Problem: it depends on having a data frame named dat
in the global environment . Change the name of your data and it breaks.
dat
should be an argument. Problem: Global assignment <<-
should be avoided . There are certain advanced cases where it is necessary (eg, sometimes in Shiny apps), but in general it makes a function behave in very un-R-like ways.
return()
a value and assign it like any other normal R function. Problem: It's over-complicated. You're by passing in SUBJ, GROUP, and OBJECT but only using them to subset you're trying to do inside your function the "grouping" bit that dplyr
or data.table
or base::ave
excels at. It's as if you're trying to build you function in a way so that if could only possibly be used embedded in this particular for
loop.
dplyr
or data.table
or ave
(or even a for
loop) to do the split-apply-combining of it. This also makes your function more generally useful instead of being cemented to this one particular case.With the above in mind, here's an attempted re-write:
TUK2 <- function (RT){
RT[RT < 1000 | RT > 2000] <- NA
return(RT)
}
See how much simpler! Now if we want to apply this function to each of the GROUP:SUBJ:OBJECT groupings in your data, and replace the RT column with the result, we do this with dplyr
:
library(dplyr)
group_by(dat, Group, SubjectID, Object) %>%
mutate(new_RT = TUK2(RT))
dplyr
does the grouping of data, the splitting of data, applies the simple function to each piece, and combines it all back together for us.
Now, in your question, you said
For each combination of variables 1, 2 and 3 I want to change the reaction time, so that all values above the 3rd Quartile + 1.5IQR are set to the value of 3rd Quartile + 1.5 IQR.
This doesn't sound much like what your function does. Based only on this description, I would code this as
group_by(dat, Group, SubjectID, Object) %>%
mutate(new_RT = pmin(RT, quantile(RT, probs = 0.75) + 1.5 * IQR(RT)))
pmin
is for parallel minimum , it's a vectorized way to take the smaller of two vectors. Try, eg, pmin(1:10, 7)
, to see what it does.
In both examples, the dplyr
data frame won't be saved, of course, unless you re-assign it with dat <- group_by(dat, ...)
etc. This is the functional programming way of doing things - no global assignment.
One additional note: with the re-written function you could still use loops instead of dplyr
. I don't know why you would - surely the dplyr
syntax is nicer - but I just want to illustrate that the small building-block function is generally useful, it's not "baking in" dplyr
in the way that your original function was "baking in" a particular for loop.
for (sub %in% unique(dat$SubjectID)) {
for (obj %in% unique(dat$Object)) {
for (grp %in% unique(dat$Group)) {
dat[dat$SubjectID == sub &
dat$Object == obj &
dat$Group == grp, "RT"] <-
TUK2(
dat[dat$SubjectID == sub &
dat$Object == obj &
dat$Group == grp, "RT"]
)
}
}
}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.