简体   繁体   English

选择特定时间范围内的行

[英]Select rows within a particular time range

I have a data frame like: 我有一个像这样的数据框:

TimeStamp                    Category

2013-11-02 07:57:18 AM         0
2013-11-02 08:07:19 AM         0
2013-11-02 08:07:21 AM         0
2013-11-02 08:07:25 AM         1
2013-11-02 08:07:29 AM         0
2013-11-02 08:08:18 AM         0
2013-11-02 08:09:20 AM         0
2013-11-02 09:04:18 AM         0
2013-11-02 09:05:22 AM         0
2013-11-02 09:07:18 AM         0

What I want to do is to select the +-10 minute time frames when Category is "1". 我要做的是在“ Category为“ 1”时选择+ -10分钟的时间范围。

For this case, because category = 1 is at 2013-11-02 08:07:25 AM , I want to select all rows within 07:57:25 AM to 08:17:25 AM . 对于这种情况,因为category = 12013-11-02 08:07:25 AM ,所以我想选择07:57:25 AM to 08:17:25 AM内的所有行。

What is the best way to handle this task? 处理此任务的最佳方法是什么?

addition, there maybe multiple "1" for each time frame. 此外,每个时间范围可能会有多个“ 1”。 (the real data frame is more complicate, it contains multiple TimeStamp with different users, ie there is another column named "UserID") (实际数据帧更加复杂,它包含多个具有不同用户的TimeStamp,即还有一个名为“ UserID”的列)

In base R, without lubridate-ing or anything else (assuming that you're going to convert TimeStamp to a POSIXct object), like: 在基础R中,无需进行润滑或其他任何操作(假设您要将TimeStamp转换为POSIXct对象),例如:

df$TimeStamp <- as.POSIXct(TimeStamp, format = "%Y-%m-%d %I:%M:%S %p")
df[with(df, abs(difftime(TimeStamp[Category==1],TimeStamp,units="mins")) <= 10 ),]

#            TimeStamp Category
#2 2013-11-02 08:07:19        0
#3 2013-11-02 08:07:21        0
#4 2013-11-02 08:07:25        1
#5 2013-11-02 08:07:29        0
#6 2013-11-02 08:08:18        0
#7 2013-11-02 08:09:20        0

If you've got multiple 1 's, you'd have to loop over it like: 如果您有多个1 ,则必须像这样循环遍历:

check <- with(df, 
  lapply(TimeStamp[Category==1], function(x) abs(difftime(x,TimeStamp,units="mins")) <= 10 ) 
)
df[do.call(pmax, check)==1,]

Here's how I would approach this using data.table::foverlaps 这是我将如何使用data.table::foverlaps来解决这个data.table::foverlaps

First, convert TimeStamp to a proper POSIXct 首先,将TimeStamp转换为适当的POSIXct

library(data.table)
setDT(df)[, TimeStamp := as.POSIXct(TimeStamp, format = "%Y-%m-%d %I:%M:%S %p")]

Then we will create a temporary data set where Category == 1 to join against. 然后,我们将创建一个临时数据集,其中Category == 1要加入。 We will also create an "end" column and key by both "start" and "end" columns 我们还将通过“开始”和“结束”列创建一个“结束”列和key

df2 <- setkey(df[Category == 1L][, TimeStamp2 := TimeStamp], TimeStamp, TimeStamp2)

Then, we will do the same for df but will set 10 minutes intervals 然后,我们将对df进行相同操作,但将间隔设置为10分钟

setkey(df[, `:=`(start = TimeStamp - 600, end = TimeStamp + 600)], start, end)

Then, all is left to do is to run foverlaps and subset by matched incidences 然后,剩下要做的就是按照匹配的概率运行foverlaps和子集

indx <- foverlaps(df, df2, which = TRUE, nomatch = 0L)$xid
df[indx, .(TimeStamp,  Category)]
#              TimeStamp Category
# 1: 2013-11-02 08:07:19        0
# 2: 2013-11-02 08:07:21        0
# 3: 2013-11-02 08:07:25        1
# 4: 2013-11-02 08:07:29        0
# 5: 2013-11-02 08:08:18        0
# 6: 2013-11-02 08:09:20        0

This seems to work: 这似乎可行:

Data: 数据:

As per @DavidArenburg 's comment (and as mentioned in his answer) the right way to convert the timestamp column into a POSIXct object is (if it not already): 根据@DavidArenburg的评论(以及他的回答中所述),将timestamp列转换为POSIXct对象的正确方法是(如果尚未):

df$TimeStamp <- as.POSIXct(df$TimeStamp, format = "%Y-%m-%d %I:%M:%S %p")

Solution: 解:

library(lubridate) #for minutes
library(dplyr)     #for between
pickrows <- function(df) {
  #pick category == 1 rows
  df2 <- df[df$Category==1,]
  #for each timestamp create two variables start and end
  #for +10 and -10 minutes
  #then pick rows between them
  lapply(df2$TimeStamp, function(time) {
      start <- time - minutes(10)
      end   <- time + minutes(10)
      df[between(df$TimeStamp, start, end),]
  })
} 

#run function
pickrows(df)

Output: 输出:

> pickrows(df)
[[1]]
            TimeStamp Category
2 2013-11-02 08:07:19        0
3 2013-11-02 08:07:21        0
4 2013-11-02 08:07:25        1
5 2013-11-02 08:07:29        0
6 2013-11-02 08:08:18        0
7 2013-11-02 08:09:20        0

Keep in mind that the output in case of multiple Category==1 rows, my function's output will be a list (in this ocassion it only has one element) so a do.call(rbind, pickrows(df)) will be needed to combine everything in one data.frame. 请记住,如果有多个Category==1行,则输出,我的函数的输出将是一个列表(在这种情况下,它只有一个元素),因此需要do.call(rbind, pickrows(df))将所有内容组合到一个data.frame中。

Using lubridate: 使用lubridate:

df$TimeStamp <- ymd_hms(df$TimeStamp)
span10 <- (df$TimeStamp[df$Category == 1] - minutes(10)) %--% (df$TimeStamp[df$Category == 1] + minutes(10))
df[df$TimeStamp %within% span10,]
            TimeStamp Category
2 2013-11-02 08:07:19        0
3 2013-11-02 08:07:21        0
4 2013-11-02 08:07:25        1
5 2013-11-02 08:07:29        0
6 2013-11-02 08:08:18        0
7 2013-11-02 08:09:20        0

I personally like the simplicity in the base R answer from @thelatemail. 我个人喜欢@thelatemail提供的基本R答案的简单性。 But just for fun, I'll provide another answer using rolling joins in data.table , as opposed to overlapping range joins solution provided by @DavidArenburg. 但是只是为了好玩,我将使用data.table 滚动联接提供另一个答案,这 @DavidArenburg提供的重叠范围联接解决方案相反。

require(data.table)
dt_1 = dt[Category == 1L]
setkey(dt, TimeStamp)

ix1 = dt[.(dt_1$TimeStamp - 600L), roll=-Inf, which=TRUE] # NOCB
ix2 = dt[.(dt_1$TimeStamp + 600L), roll= Inf, which=TRUE] # LOCF

indices = data.table:::vecseq(ix1, ix2-ix1+1L, NULL) # not exported function
dt[indices]
#              TimeStamp Category
# 1: 2013-11-02 08:07:19        0
# 2: 2013-11-02 08:07:21        0
# 3: 2013-11-02 08:07:25        1
# 4: 2013-11-02 08:07:29        0
# 5: 2013-11-02 08:08:18        0
# 6: 2013-11-02 08:09:20        0

This should work just fine even if you've got more than one cell where Category is 1, AFAICT. 即使您有多个单元格的Category为1,AFAICT,这也应该可以正常工作。 It'd be great to wrap this up as a feature for this type of operations for data.table ... 将其包装为此类data.table操作的功能data.table ...

PS: refer to the other posts for converting TimeStamp into POSIXct format. PS:请参阅其他文章,以将TimeStamp转换为POSIXct格式。

Here is my solution with dplyr and lubridate . 这是我用dplyrlubridate解决方案。 Here are the steps: 步骤如下:

Find where category ==1 , add to this, + and - 10 minutes with the lubridate 's minutes with a simple c(-1, 1) * minutes(10) then using filter to subset based on the two interval stored in the rang vector. 发现其中category ==1 ,添加到这个, +- 10分钟与lubridateminutes以简单的c(-1, 1) * minutes(10) 接着使用filter ,以子集基于存储在所述两个间隔rang矢量。

library(lubridate)
library(dplyr)
wi1 <- which(dat$Category == 1 )
rang <- dat$TimeStamp[wi1] +  c(-1,1) * minutes(10)
dat %>% filter(TimeStamp >= rang[1] & TimeStamp <= rang[2])
            TimeStamp Category
1 2013-11-02 08:07:19        0
2 2013-11-02 08:07:21        0
3 2013-11-02 08:07:25        1
4 2013-11-02 08:07:29        0
5 2013-11-02 08:08:18        0
6 2013-11-02 08:09:20        0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM