简体   繁体   English

日期时间的计数实例在R数据框中的所有行上重叠

[英]Count instance of datetime overlap across all rows in R dataframe

hoping someone can assist me here. 希望有人可以在这里帮助我。 i've tried searching but nothing seems to match what i'm trying to do. 我曾尝试搜索,但似乎没有任何内容与我正在尝试的匹配。

I'm trying to calculate, for each row in my dataframe, the number of instances that the rows datetime is repeated across time ranges in any other row. 我正在尝试为数据帧中的每一行计算跨任何其他行的时间范围重复日期时间的实例数。

I have a data frame which contains 3 datetimes columns, which are POSIXt, format= dd/mm/yyyy HH:MM. 我有一个数据框,其中包含3个datetimes列,它们是POSIXt,格式为dd / mm / yyyy HH:MM。

I'd like my calculation to be in a new column called "duplicates" 我希望我的计算结果出现在名为“重复项”的新列中

|Row  | :Start_time:     | :Start_time_beg: | :Start_time_end:|
|--   |------------------|------------------|-----------------|
|1    | 01/01/2017 03:00 | 01/01/2017 01:30 | 01/01/2017 04:30|
|2    | 01/01/2017 04:00 | 01/01/2017 02:30 | 01/01/2017 05:30|
|3    | 01/01/2017 04:10 | 01/01/2017 02:40 | 01/01/2017 05:40|
|4    | 01/01/2017 05:00 | 01/01/2017 03:30 | 01/01/2017 06:30|
|5    | 01/01/2017 08:00 | 01/01/2017 06:30 | 01/01/2017 09:30|

So in the above example data i'd like to count every instance that Start_time occurs in the range Start_time_beg : Start_time_end for rows 1:n 因此,在上面的示例数据中,我要对第1行:n中Start_time发生在Start_time_beg:Start_time_end范围内的每个实例进行计数

The results for this data would be: 该数据的结果将是:

|Row     |Duplicates|
|----    |----------|
|:1:     | :3:      | (3 as overlaps with rows 1,2,3)
|:2:     | :4:      | (4 overlaps with rows 1,2,3,4)
|:3:     | :4:      | (4 overlaps with rows 1,2,3,4)
|:4:     | :3:      | (3 overlaps with rows 2,3,4)
|:5:     | :1:      | (1 as only overlaps with itself, row 5)

my thought was to create a seq array for each Start_time_beg:Start_time_End. 我的想法是为每个Start_time_beg:Start_time_End创建一个seq数组。 Then create a data frame, with count of Start_time from that. 然后创建一个数据帧,从中开始计数为Start_time。 I could then join this back onto the original df. 然后,我可以将其重新加入到原始df中。

so far I have 到目前为止,我有

x <- d1$Start_Time
y <- d1$Start_Time_Beg
z <- d1$Start_Time_End


t <- seq(y[1],z[1],"mins")
t2<- seq(y[2],z[2],"mins")

tn <- c(t,t2)

p<-count(tn,'tn')

Which gives me the desired df(p) from the time range array. 这给了我时间范围数组中所需的df(p)。 The problem is I have tried to create a loop to generate t:nrows (rows goes into thousands so can't be manually typed) but i'm having no look 问题是我试图创建一个循环来生成t:nrows(行数成千上万,所以不能手动键入),但是我没有看

for (i in 1:length(d1$Start_Time))
{seq(d$Start_Time_Beg[c(1+i)],d$Start_Time_End[c(1+i)],"mins")}

This just gives me an int length = nrows. 这只是给我一个int length = nrows。 Not the array of datetimes I was after. 不是我追求的日期时间数组。

I'm not even sure if this is the right way to go about this i've had a bash at trying to use dplyr but no luck. 我什至不确定这是否是正确的解决方法,我曾尝试使用dplyr但没有运气。

Any help much appreciated. 任何帮助,不胜感激。 Apologies my tables don't seem to have aligned properly 抱歉,我的桌子似乎未正确对齐

Thanks in advance for any help 预先感谢您的任何帮助

With data.table this is a one-liner: 使用data.table这是一种情况:

library(data.table)   # CRAN verison 1.10.4 used
setDT(DT)
DT[DT, on = .(Start_time >= Start_time_beg, Start_time <= Start_time_end), 
   Duplicates := .N, by = .EACHI][]
  Row Start_time Start_time_beg Start_time_end Duplicates <int> <POSc> <POSc> <POSc> <int> 1: 1 2017-01-01 03:00:00 2017-01-01 01:30:00 2017-01-01 04:30:00 4 2: 2 2017-01-01 04:00:00 2017-01-01 02:30:00 2017-01-01 05:30:00 3 3: 3 2017-01-01 04:10:00 2017-01-01 02:40:00 2017-01-01 05:40:00 3 4: 4 2017-01-01 05:00:00 2017-01-01 03:30:00 2017-01-01 06:30:00 3 5: 5 2017-01-01 08:00:00 2017-01-01 06:30:00 2017-01-01 09:30:00 1 

Explanation 说明

After coersion to class data.table , DT is joined with itself using non-equi joins . 在强制转换为data.table类data.tableDT使用非等 data.table联接与其自身联接 The multiple matching rows are immediately counted ( .N ) by the join parameters ( grouping by each i ). 多个匹配的行立即由连接参数( 按每个i分组 )计数( .N )。 Finally, the count is assigned to a new column of DT ( update on join ). 最后,将计数分配给DT的新列( join上的更新 )。

Data 数据

library(data.table)
options(datatable.print.class = TRUE)

DT <- fread(
  "|Row  | Start_time     | Start_time_beg | Start_time_end|
  |1    | 01/01/2017 03:00 | 01/01/2017 01:30 | 01/01/2017 04:30|
  |2    | 01/01/2017 04:00 | 01/01/2017 02:30 | 01/01/2017 05:30|
  |3    | 01/01/2017 04:10 | 01/01/2017 02:40 | 01/01/2017 05:40|
  |4    | 01/01/2017 05:00 | 01/01/2017 03:30 | 01/01/2017 06:30|
  |5    | 01/01/2017 08:00 | 01/01/2017 06:30 | 01/01/2017 09:30|",
  sep = "|", drop = c(1, 6))
cols <- stringr::str_subset(names(DT), "time")
DT[, (cols) := lapply(.SD, lubridate::dmy_hm), .SDcols = cols]
DT
  Row Start_time Start_time_beg Start_time_end <int> <POSc> <POSc> <POSc> 1: 1 2017-01-01 03:00:00 2017-01-01 01:30:00 2017-01-01 04:30:00 2: 2 2017-01-01 04:00:00 2017-01-01 02:30:00 2017-01-01 05:30:00 3: 3 2017-01-01 04:10:00 2017-01-01 02:40:00 2017-01-01 05:40:00 4: 4 2017-01-01 05:00:00 2017-01-01 03:30:00 2017-01-01 06:30:00 5: 5 2017-01-01 08:00:00 2017-01-01 06:30:00 2017-01-01 09:30:00 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM