I would like to loop through the following data.frame and group by sequential entries, as determined by the value in X2. So in the following data.frame, we can see four groups: 1-3, 5-6, 9-13, and 16. We could have any combination of group sizes and number of groups.
X1 X2 X3 X4
1 1_21/08/2014 22:56CONTENT_ACCESS.preparing 1 21/08/2014 22:56 CONTENT_ACCESS.preparing
2 2_21/08/2014 22:57CONTENT_ACCESS.preparing 2 21/08/2014 22:57 CONTENT_ACCESS.preparing
3 3_21/08/2014 22:58CONTENT_ACCESS.preparing 3 21/08/2014 22:58 CONTENT_ACCESS.preparing
4 5_21/08/2014 23:07CONTENT_ACCESS.preparing 5 21/08/2014 23:07 CONTENT_ACCESS.preparing
5 6_21/08/2014 23:08CONTENT_ACCESS.preparing 6 21/08/2014 23:08 CONTENT_ACCESS.preparing
6 9_21/08/2014 23:29CONTENT_ACCESS.preparing 9 21/08/2014 23:29 CONTENT_ACCESS.preparing
7 10_21/08/2014 23:30CONTENT_ACCESS.preparing 10 21/08/2014 23:30 CONTENT_ACCESS.preparing
8 11_21/08/2014 23:31CONTENT_ACCESS.preparing 11 21/08/2014 23:31 CONTENT_ACCESS.preparing
9 12_21/08/2014 23:33CONTENT_ACCESS.preparing 12 21/08/2014 23:33 CONTENT_ACCESS.preparing
10 13_21/08/2014 23:34CONTENT_ACCESS.preparing 13 21/08/2014 23:34 CONTENT_ACCESS.preparing
11 16_21/08/2014 23:40CONTENT_ACCESS.preparing 16 21/08/2014 23:40 CONTENT_ACCESS.preparing
I would like to capture the timestamps in X3 so they can describe the time range (ie the first and last timestamp of each group) and produce this output. start_ts is the first timestamp and stop_ts is the last in each group:
student_id session_id start_ts stop_ts week micro_process
1 4 16 21/08/2014 22:56 21/08/2014 22:58 4 TASK
2 4 16 21/08/2014 23:07 21/08/2014 23:08 4 TASK
3 4 16 21/08/2014 23:29 21/08/2014 23:34 4 TASK
3 4 16 21/08/2014 23:40 21/08/2014 23:40 4 TASK
I haven't yet attempted the loop but would like to see how to do it without traditional looping. My code currently only captures the range of the whole group:
student_id session_id start_ts stop_ts week micro_process
1 4 16 21/08/2014 22:58 21/08/2014 23:30 4 TASK
The other variables (student ID etc.) have been dummified in my example and are not strictly relevant but I would like to leave them in for completeness.
Code (which can be run directly):
library(stringr)
options(stringsAsFactors = FALSE)
eventised_session <- data.frame(student_id=integer(),
session_id=integer(),
start_ts=character(),
stop_ts=character(),
week=integer(),
micro_process=character())
string_match.df <- structure(list(X1 = c("1_21/08/2014 22:56CONTENT_ACCESS.preparing",
"2_21/08/2014 22:57CONTENT_ACCESS.preparing", "3_21/08/2014 22:58CONTENT_ACCESS.preparing",
"5_21/08/2014 23:07CONTENT_ACCESS.preparing", "6_21/08/2014 23:08CONTENT_ACCESS.preparing",
"9_21/08/2014 23:29CONTENT_ACCESS.preparing", "10_21/08/2014 23:30CONTENT_ACCESS.preparing",
"11_21/08/2014 23:31CONTENT_ACCESS.preparing", "12_21/08/2014 23:33CONTENT_ACCESS.preparing",
"13_21/08/2014 23:34CONTENT_ACCESS.preparing", "16_21/08/2014 23:40CONTENT_ACCESS.preparing"
), X2 = c("1", "2", "3", "5", "6", "9", "10", "11", "12", "13",
"16"), X3 = c("21/08/2014 22:56", "21/08/2014 22:57", "21/08/2014 22:58",
"21/08/2014 23:07", "21/08/2014 23:08", "21/08/2014 23:29", "21/08/2014 23:30",
"21/08/2014 23:31", "21/08/2014 23:33", "21/08/2014 23:34", "21/08/2014 23:40"
), X4 = c("CONTENT_ACCESS.preparing", "CONTENT_ACCESS.preparing",
"CONTENT_ACCESS.preparing", "CONTENT_ACCESS.preparing", "CONTENT_ACCESS.preparing",
"CONTENT_ACCESS.preparing", "CONTENT_ACCESS.preparing", "CONTENT_ACCESS.preparing",
"CONTENT_ACCESS.preparing", "CONTENT_ACCESS.preparing", "CONTENT_ACCESS.preparing"
)), .Names = c("X1", "X2", "X3", "X4"), row.names = c(NA, -11L
), class = "data.frame")
r_student_id <- 4
r_session_id <- 16
r_week <- 4
r_mic_proc <- "TASK"
string_match.df
#Get the first and last timestamp in matched sequence
r_start_ts <- string_match.df[1, ncol(string_match.df)-1]
r_stop_ts <- string_match.df[nrow(string_match.df), ncol(string_match.df)-1]
eventised_session[nrow(eventised_session)+1,] <- c(r_student_id, r_session_id, r_start_ts, r_stop_ts, r_week, r_mic_proc)
eventised_session
I would appreciate you expertise on this one. I have only ever used traditional loops.
We convert to numeric, subtract off a sequence so that adjacent numbers will be converted to the same number. Since you don't provide desired output and reference column names that differ from the names of your example data, I'm guessing at the end result (based on the other answer):
string_match.df$X2 = as.numeric(string_match.df$X2)
string_match.df$grp = string_match.df$X2 - 1:nrow(string_match.df)
string_match.df
library(dplyr)
string_match.df %>%
group_by(grp) %>%
summarize(start = first(X3), stop = last(X3))
# grp start stop
# <dbl> <chr> <chr>
# 1 0 21/08/2014 22:56 21/08/2014 22:58
# 2 1 21/08/2014 23:07 21/08/2014 23:08
# 3 3 21/08/2014 23:29 21/08/2014 23:34
# 4 5 21/08/2014 23:40 21/08/2014 23:40
As a side note, be careful with the term "matrix". You used the matrix tag and used the word matrix
several times in your question, but you don't have a matrix
, nor should you be using one. You have a data.frame
. In a matrix
, all data must be the same type. In a data frame, the columns can have different types. Here you have a numeric column, two string columns, and one datetime column, so a matrix would be a poor choice. A data frame, where each of those columns can be of the appropriate class, is much better.
I'm using a shorter name for the data, and converting df$X2 to numeric:
df <- string_match.df # as defined in OP
df$X2 <- as.numeric(df$X2)
You can split your data frame using a combination of cumsum
and diff
:
cumsum(diff(c(0,as.numdf$X2))>1)
# [1] 0 0 0 1 1 2 2 2 2 2 3
# presumes that df$X2[1] is 1, but you can easily make up a general case:
# cumsum(diff(c(df$X2[1]-1,df$X2))>1)
And now just use split
and lapply
:
do.call(rbind,lapply(split(df, cumsum(diff(c(0,df$X2))>1)), function(x) {foo <- x$X3; data.frame(start_ts=foo[1], stop_ts=tail(foo,1))}))
# output:
start_ts stop_ts
0 21/08/2014 22:56 21/08/2014 22:58
1 21/08/2014 23:07 21/08/2014 23:08
2 21/08/2014 23:29 21/08/2014 23:34
3 21/08/2014 23:40 21/08/2014 23:40
The rest is a question of formatting the output as you wish.
Your new question can be done pretty easily in tidyverse
. The main thing you have to do is divide your observations into groups based on the timestamp
variable. I assumed that the rule would be to start a new group if more than 2 minutes passed since the last observation. You can change that easily if you need to.
Once the observations are grouped, you can simply use summarize
to return the results on calculations by group (in this case, the first and last timepoints):
library(dplyr)
library(lubridate)
string_match.df %>%
select('id' = X2, # Select and rename variables
'timestamp' = X3) %>%
mutate(timestamp = dmy_hm(timestamp), # Parse timestamp as date
time_diff = timestamp - lag(timestamp), # Calculate time from last obs
new_obs = time_diff > 2) | # New obs. if >2 min from last one
is.na(time_diff), # or, if it's the 1st obs.
group_id = cumsum(new_obs)) %>% # Count new groups for group ID
group_by(group_id) %>% # Group by 'group_id'
summarize(start_ts = min(timestamp), # Then return the first and last
stop_ts = max(timestamp)) # timestamps for each group
# A tibble: 4 x 3
group_id start_ts stop_ts
<int> <dttm> <dttm>
1 1 2014-08-21 22:56:00 2014-08-21 22:58:00
2 2 2014-08-21 23:07:00 2014-08-21 23:08:00
3 3 2014-08-21 23:29:00 2014-08-21 23:34:00
4 4 2014-08-21 23:40:00 2014-08-21 23:40:00
Since there was no discussion in your question about how student_id
, session_id
, week
, and micro_process
are determined, I left them out from my example. You can easily add them onto the table after, or add new rules to the summarize
call if they are determined by parsing data for the group.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.