简体   繁体   中英

operate a custom loop inside ddply

My data set has about 54,000 rows. I want to set a value (First_Pass) to either T or F depending upon both a value in another column and also whether or not that other column's value has been seen before. I have a for loop that does exactly what I need it to do. However, that loop is only for a subset of the data. I need that same for loop to be run individually for different subsets based upon factor levels.

This seems like the perfect case for the plyr functions as I want to split the data into subsets, apply a function (my for loop) and then rejoin the data. However, I cannot get it to work. First, I give a sample of the df, called char.data.

     session_id list Sent_Order Sentence_ID Cond1 Cond2 Q_ID   Was_y CI CI_Delta character tsle tsoc Direct
5139          2    b          9          25    rc    su   25 correct  1        0         T  995   56      R
5140          2    b          9          25    rc    su   25 correct  2        1         h   56   56      R
5141          2    b          9          25    rc    su   25 correct  3        1         e   56   56      R
5142          2    b          9          25    rc    su   25 correct  4        1             56   37      R

There is some clutter in there. The key columns are session_id, Sentence_ID, CI, and CI_Delta.

I then initialise a column called First_Pass to "F"

char.data$First_Pass <- "F"

I want to now calculate when First_Pass is actually "T" for each combination of session_id and Sentence_ID. I created a toy set, which is just one subset to work out the overall logic. Here's the code of a for loop that gives me just what I want for the toy data.

char.data.toy$First_Pass <- "F"
l <-c(200)
for (i in 1:nrow(char.data.toy)) {
  if(char.data.toy[i,]$CI_Delta >= 0 & char.data.toy[i,]$CI %nin% l){
    char.data.toy[i,]$First_Pass <- "T"
    l <- c(l,char.data.toy[i,]$CI)}
}

I now want to take this loop and run it for every session_id and Sentence_ID subset. I've created a function called set_fp and then called it inside ddply. Here is that code:

#define function
set_fp <- function (df){

  l <- 200
  for (i in 1:nrow(df)) {
    if(df[i,]$CI_Delta >= 0 & df[i,]$CI %nin% l){
      df[i,]$First_Pass <- "T"
      l <- c(l,df[i,]$CI)}
    else df[i,]$First_Pass <- "F"
    return(df)
  }

}

char.data.fp <- ddply(char.data,c("session_id","Sentence_ID"),function(df)set_fp(df))

Unfortunately, this is not quite right. For a long time, I was getting all "F" values for First_Pass. Now I'm getting 24 T values, when it should be many more, so I suspect, it's only keeping the last subset or something similar. Help?

This is a little hard to test with only the four rows that you've provided. I created random data to see if it works and it seems to work for me. Try it on you data too.

This uses the data.table library and doesn't try to run loops inside a ddply . I'm assuming the means aren't important.

library(data.table)
dt <- data.table(df)  
l <- c(200)

# subsetting to keep only the important fields
dt <- dt[,list(session_id, Sentence_ID, CI, CI_Delta)]

# Initialising First_Pass    
dt[,First_Pass := 'F']

# The next two lines are basically rewording your logic -

# Within each group of session_id, Sentence_ID, identify the duplicate CI entries. These would have been inserted in l. The first time occurence of these CI entries is marked false as they wouldn't have been in l when that row was being checked 
dt[CI_Delta >= 0,duplicatedCI := duplicated(CI), by = c("session_id", "Sentence_ID")]

# So if the CI value hasn't occurred before within the session_id,Sentence_ID group, and it doesn't appear in l, then mark it as "T"
dt[!(CI %in% l) & !(duplicatedCI), First_Pass := "T"]

# Just for curiosity's sake, calculating l too
l <- c(l,dt[duplicatedCI == FALSE,CI])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM