简体   繁体   中英

Efficient dataframe looping in R

I would like to loop through the following data.frame and group by sequential entries, as determined by the value in X2. So in the following data.frame, we can see four groups: 1-3, 5-6, 9-13, and 16. We could have any combination of group sizes and number of groups.

                                            X1 X2               X3                       X4
1   1_21/08/2014 22:56CONTENT_ACCESS.preparing  1 21/08/2014 22:56 CONTENT_ACCESS.preparing
2   2_21/08/2014 22:57CONTENT_ACCESS.preparing  2 21/08/2014 22:57 CONTENT_ACCESS.preparing
3   3_21/08/2014 22:58CONTENT_ACCESS.preparing  3 21/08/2014 22:58 CONTENT_ACCESS.preparing
4   5_21/08/2014 23:07CONTENT_ACCESS.preparing  5 21/08/2014 23:07 CONTENT_ACCESS.preparing
5   6_21/08/2014 23:08CONTENT_ACCESS.preparing  6 21/08/2014 23:08 CONTENT_ACCESS.preparing
6   9_21/08/2014 23:29CONTENT_ACCESS.preparing  9 21/08/2014 23:29 CONTENT_ACCESS.preparing
7  10_21/08/2014 23:30CONTENT_ACCESS.preparing 10 21/08/2014 23:30 CONTENT_ACCESS.preparing
8  11_21/08/2014 23:31CONTENT_ACCESS.preparing 11 21/08/2014 23:31 CONTENT_ACCESS.preparing
9  12_21/08/2014 23:33CONTENT_ACCESS.preparing 12 21/08/2014 23:33 CONTENT_ACCESS.preparing
10 13_21/08/2014 23:34CONTENT_ACCESS.preparing 13 21/08/2014 23:34 CONTENT_ACCESS.preparing
11 16_21/08/2014 23:40CONTENT_ACCESS.preparing 16 21/08/2014 23:40 CONTENT_ACCESS.preparing

I would like to capture the timestamps in X3 so they can describe the time range (ie the first and last timestamp of each group) and produce this output. start_ts is the first timestamp and stop_ts is the last in each group:

student_id session_id start_ts           stop_ts             week micro_process
1          4         16 21/08/2014 22:56 21/08/2014 22:58    4          TASK
2          4         16 21/08/2014 23:07 21/08/2014 23:08    4          TASK
3          4         16 21/08/2014 23:29 21/08/2014 23:34    4          TASK
3          4         16 21/08/2014 23:40 21/08/2014 23:40    4          TASK

I haven't yet attempted the loop but would like to see how to do it without traditional looping. My code currently only captures the range of the whole group:

  student_id session_id         start_ts          stop_ts week micro_process
1          4         16 21/08/2014 22:58 21/08/2014 23:30    4          TASK

The other variables (student ID etc.) have been dummified in my example and are not strictly relevant but I would like to leave them in for completeness.

Code (which can be run directly):

library(stringr)
options(stringsAsFactors = FALSE) 

eventised_session <- data.frame(student_id=integer(),
                                session_id=integer(), 
                                start_ts=character(),
                                stop_ts=character(),
                                week=integer(),
                                micro_process=character())

string_match.df <- structure(list(X1 = c("1_21/08/2014 22:56CONTENT_ACCESS.preparing", 
                                         "2_21/08/2014 22:57CONTENT_ACCESS.preparing", "3_21/08/2014 22:58CONTENT_ACCESS.preparing", 
                                         "5_21/08/2014 23:07CONTENT_ACCESS.preparing", "6_21/08/2014 23:08CONTENT_ACCESS.preparing", 
                                         "9_21/08/2014 23:29CONTENT_ACCESS.preparing", "10_21/08/2014 23:30CONTENT_ACCESS.preparing", 
                                         "11_21/08/2014 23:31CONTENT_ACCESS.preparing", "12_21/08/2014 23:33CONTENT_ACCESS.preparing", 
                                         "13_21/08/2014 23:34CONTENT_ACCESS.preparing", "16_21/08/2014 23:40CONTENT_ACCESS.preparing"
), X2 = c("1", "2", "3", "5", "6", "9", "10", "11", "12", "13", 
          "16"), X3 = c("21/08/2014 22:56", "21/08/2014 22:57", "21/08/2014 22:58", 
                        "21/08/2014 23:07", "21/08/2014 23:08", "21/08/2014 23:29", "21/08/2014 23:30", 
                        "21/08/2014 23:31", "21/08/2014 23:33", "21/08/2014 23:34", "21/08/2014 23:40"
          ), X4 = c("CONTENT_ACCESS.preparing", "CONTENT_ACCESS.preparing", 
                    "CONTENT_ACCESS.preparing", "CONTENT_ACCESS.preparing", "CONTENT_ACCESS.preparing", 
                    "CONTENT_ACCESS.preparing", "CONTENT_ACCESS.preparing", "CONTENT_ACCESS.preparing", 
                    "CONTENT_ACCESS.preparing", "CONTENT_ACCESS.preparing", "CONTENT_ACCESS.preparing"
          )), .Names = c("X1", "X2", "X3", "X4"), row.names = c(NA, -11L
          ), class = "data.frame")

r_student_id <- 4
r_session_id <- 16
r_week <- 4
r_mic_proc <- "TASK"

string_match.df

#Get the first and last timestamp in matched sequence
r_start_ts <- string_match.df[1, ncol(string_match.df)-1]
r_stop_ts <- string_match.df[nrow(string_match.df), ncol(string_match.df)-1]

eventised_session[nrow(eventised_session)+1,] <- c(r_student_id, r_session_id, r_start_ts, r_stop_ts, r_week, r_mic_proc)

eventised_session

I would appreciate you expertise on this one. I have only ever used traditional loops.

We convert to numeric, subtract off a sequence so that adjacent numbers will be converted to the same number. Since you don't provide desired output and reference column names that differ from the names of your example data, I'm guessing at the end result (based on the other answer):

string_match.df$X2 = as.numeric(string_match.df$X2)
string_match.df$grp = string_match.df$X2 - 1:nrow(string_match.df)
string_match.df

library(dplyr)
string_match.df %>%
  group_by(grp) %>% 
  summarize(start = first(X3), stop = last(X3))
#     grp start            stop            
#   <dbl> <chr>            <chr>           
# 1     0 21/08/2014 22:56 21/08/2014 22:58
# 2     1 21/08/2014 23:07 21/08/2014 23:08
# 3     3 21/08/2014 23:29 21/08/2014 23:34
# 4     5 21/08/2014 23:40 21/08/2014 23:40

As a side note, be careful with the term "matrix". You used the matrix tag and used the word matrix several times in your question, but you don't have a matrix , nor should you be using one. You have a data.frame . In a matrix , all data must be the same type. In a data frame, the columns can have different types. Here you have a numeric column, two string columns, and one datetime column, so a matrix would be a poor choice. A data frame, where each of those columns can be of the appropriate class, is much better.

I'm using a shorter name for the data, and converting df$X2 to numeric:

df <- string_match.df  # as defined in OP
df$X2 <- as.numeric(df$X2)

You can split your data frame using a combination of cumsum and diff :

cumsum(diff(c(0,as.numdf$X2))>1)
#  [1] 0 0 0 1 1 2 2 2 2 2 3
# presumes that df$X2[1] is 1, but you can  easily make up a general case:
#  cumsum(diff(c(df$X2[1]-1,df$X2))>1)

And now just use split and lapply :

do.call(rbind,lapply(split(df, cumsum(diff(c(0,df$X2))>1)), function(x) {foo <- x$X3; data.frame(start_ts=foo[1], stop_ts=tail(foo,1))}))
# output:
          start_ts          stop_ts
0 21/08/2014 22:56 21/08/2014 22:58
1 21/08/2014 23:07 21/08/2014 23:08
2 21/08/2014 23:29 21/08/2014 23:34
3 21/08/2014 23:40 21/08/2014 23:40

The rest is a question of formatting the output as you wish.

Your new question can be done pretty easily in tidyverse . The main thing you have to do is divide your observations into groups based on the timestamp variable. I assumed that the rule would be to start a new group if more than 2 minutes passed since the last observation. You can change that easily if you need to.

Once the observations are grouped, you can simply use summarize to return the results on calculations by group (in this case, the first and last timepoints):

library(dplyr)
library(lubridate)

string_match.df %>%
    select('id' = X2,                              # Select and rename variables
           'timestamp' = X3) %>%
    mutate(timestamp = dmy_hm(timestamp),          # Parse timestamp as date
           time_diff = timestamp - lag(timestamp), # Calculate time from last obs
           new_obs = time_diff > 2) |              # New obs. if >2 min from last one
                     is.na(time_diff),             #   or, if it's the 1st obs.
           group_id = cumsum(new_obs)) %>%         # Count new groups for group ID
    group_by(group_id) %>%                         # Group by 'group_id'
    summarize(start_ts = min(timestamp),           # Then return the first and last
              stop_ts = max(timestamp))            #  timestamps for each group

# A tibble: 4 x 3
  group_id start_ts            stop_ts            
     <int> <dttm>              <dttm>             
1        1 2014-08-21 22:56:00 2014-08-21 22:58:00
2        2 2014-08-21 23:07:00 2014-08-21 23:08:00
3        3 2014-08-21 23:29:00 2014-08-21 23:34:00
4        4 2014-08-21 23:40:00 2014-08-21 23:40:00

Since there was no discussion in your question about how student_id , session_id , week , and micro_process are determined, I left them out from my example. You can easily add them onto the table after, or add new rules to the summarize call if they are determined by parsing data for the group.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM