汇总时间序列的多个组内的数据

Question

I have a series of observations of birds at different locations and times. 我对不同地点和时间的鸟类进行了一系列观察。 The data frame looks like this: 数据框如下所示：

birdID   site          ts
1       A          2013-04-15 09:29
1       A          2013-04-19 01:22
1       A          2013-04-20 23:13
1       A          2013-04-22 00:03
1       B          2013-04-22 14:02
1       B          2013-04-22 17:02
1       C          2013-04-22 14:04
1       C          2013-04-22 15:18
1       C          2013-04-23 00:54
1       A          2013-04-23 01:20
1       A          2013-04-24 23:07
1       A          2013-04-30 23:47
1       B          2013-04-30 03:51
1       B          2013-04-30 04:26
2       C          2013-04-30 04:29
2       C          2013-04-30 18:49
2       A          2013-05-01 01:03
2       A          2013-05-01 23:15
2       A          2013-05-02 00:09
2       C          2013-05-03 07:57
2       C          2013-05-04 07:21
2       C          2013-05-05 02:54
2       A          2013-05-05 03:27
2       A          2013-05-14 00:16
2       D          2013-05-14 10:00
2       D          2013-05-14 15:00

I would like to summarize the data in a way that shows the first and last detection of each bird at each site, and the duration at each site, while preserving information about multiple visits to sites (ie if a bird went from site A > B > C > A > B, I would like show each visit to site A and B independently, not lump both visits together). 我想以一种方式总结数据，显示每个站点的每只鸟的第一次和最后一次检测，以及每个站点的持续时间，同时保留有关多次访问站点的信息（即，如果一只鸟从站点A> B出发） > C> A> B，我想独立显示每次访问A站点和B站点，而不是将两次访问放在一起）。

I am hoping to produce output like this, where the start (min_ts), end (max_ts), and duration (days) of each visit are preserved: 我希望产生这样的输出，其中保留每次访问的开始（min_ts），结束（max_ts）和持续时间（天）：

birdID  site      min_ts                max_ts          days
1      A      2013-04-15 09:29    2013-04-22 00:03  6.6
1      B      2013-04-22 14:02    2013-04-22 17:02  0.1
1      C      2013-04-22 14:04    2013-04-23 00:54  0.5
1      A      2013-04-23 01:20    2013-04-30 23:47  7.9
1      B      2013-04-30 03:51    2013-04-30 04:26  0.02
2      C      2013-04-30 4:29     2013-04-30 18:49  0.6
2      A      2013-05-01 01:03    2013-05-02 00:09  0.96
2      C      2013-05-03 07:57    2013-05-05 02:54  1.8
2      A      2013-05-05 03:27    2013-05-14 00:16  8.8
2      D      2013-05-14 10:00    2013-05-14 15:00  0.2

I have tried this code, which yields the correct variables but lumps all the information about a single site together, not preserving multiple visits: 我尝试过这段代码，它产生了正确的变量，但是将所有关于单个站点的信息整合在一起，而不是保留多次访问：

df <- df %>%
  group_by(birdID, site) %>%
  summarise(min_ts = min(ts),
            max_ts = max(ts),
            days = difftime(max_ts, min_ts, units = "days")) %>%
  arrange(birdID, min_ts)

birdID  site    min_ts               max_ts            days
1   A   2013-04-15 09:29   2013-04-30 23:47    15.6
1   B   2013-04-22 14:02   2013-04-30 4:26     7.6
1   C   2013-04-22 14:04   2013-04-23 0:54     0.5
2   C   2013-04-30 04:29   2013-05-05 2:54     4.9
2   A   2013-05-01 01:03   2013-05-14 0:16     12.9
2   D   2013-05-14 10:00   2013-05-14 15:00    0.2

I realize grouping by site is a problem, but if I remove that as a grouping variable the data are summarised without site info. 我意识到按站点分组是一个问题，但如果我将其作为分组变量删除，则数据将在没有站点信息的情况下进行汇总。 I have tried this. 我试过这个。 It doesn't run, but I feel it's close to the solution: 它没有运行，但我觉得它接近解决方案：

df <- df %>% 
   group_by(birdID) %>% 
   summarize(min_ts = if_else((birdID == lag(birdID) & site != lag(site)), min(ts), NA_real_), 
             max_ts = if_else((birdID == lag(birdID) & site != lag(site)), max(ts), NA_real_), 
            min_d = min(yday(ts)),
            max_d = max(yday(ts)),
            days = max_d - min_d))

Answer 1

One possibility could be: 一种可能性是：

df %>%
 group_by(birdID, site, rleid = with(rle(site), rep(seq_along(lengths), lengths))) %>%
 summarise(min_ts = min(ts),
           max_ts = max(ts),
           days = difftime(max_ts, min_ts, units = "days")) %>%
 ungroup() %>%
 select(-rleid) %>%
 arrange(birdID, min_ts)

   birdID site  min_ts              max_ts              days           
    <int> <chr> <dttm>              <dttm>              <drtn>         
 1      1 A     2013-04-15 09:29:00 2013-04-22 00:03:00 6.60694444 days
 2      1 B     2013-04-22 14:02:00 2013-04-22 17:02:00 0.12500000 days
 3      1 C     2013-04-22 14:04:00 2013-04-23 00:54:00 0.45138889 days
 4      1 A     2013-04-23 01:20:00 2013-04-30 23:47:00 7.93541667 days
 5      1 B     2013-04-30 03:51:00 2013-04-30 04:26:00 0.02430556 days
 6      2 C     2013-04-30 04:29:00 2013-04-30 18:49:00 0.59722222 days
 7      2 A     2013-05-01 01:03:00 2013-05-02 00:09:00 0.96250000 days
 8      2 C     2013-05-03 07:57:00 2013-05-05 02:54:00 1.78958333 days
 9      2 A     2013-05-05 03:27:00 2013-05-14 00:16:00 8.86736111 days
10      2 D     2013-05-14 10:00:00 2013-05-14 15:00:00 0.20833333 days

Here it creates a rleid() -like grouping variable and then calculates the difference. 在这里，它创建一个rleid()的分组变量，然后计算差异。

Or the same using rleid() from data.table explicitly: 或者使用相同rleid()从data.table明确：

df %>%
 group_by(birdID, site, rleid = rleid(site)) %>%
 summarise(min_ts = min(ts),
           max_ts = max(ts),
           days = difftime(max_ts, min_ts, units = "days")) %>%
 ungroup() %>%
 select(-rleid) %>%
 arrange(birdID, min_ts)

Answer 2

Another alternative is to use lag and cumsum to create a grouping variable. 另一种方法是使用lag和cumsum来创建分组变量。

library(dplyr)

df %>%
  group_by(birdID, group = cumsum(site != lag(site, default = first(site)))) %>%
  summarise(min_ts = min(ts),
            max_ts = max(ts),
            days = difftime(max_ts, min_ts, units = "days")) %>%
  ungroup() %>%
  select(-group)

# A tibble: 10 x 4
#   birdID min_ts              max_ts              days           
#    <int> <dttm>              <dttm>              <drtn>         
# 1      1 2013-04-15 09:29:00 2013-04-22 00:03:00 6.60694444 days
# 2      1 2013-04-22 14:02:00 2013-04-22 17:02:00 0.12500000 days
# 3      1 2013-04-22 14:04:00 2013-04-23 00:54:00 0.45138889 days
# 4      1 2013-04-23 01:20:00 2013-04-30 23:47:00 7.93541667 days
# 5      1 2013-04-30 03:51:00 2013-04-30 04:26:00 0.02430556 days
# 6      2 2013-04-30 04:29:00 2013-04-30 18:49:00 0.59722222 days
# 7      2 2013-05-01 01:03:00 2013-05-02 00:09:00 0.96250000 days
# 8      2 2013-05-03 07:57:00 2013-05-05 02:54:00 1.78958333 days
# 9      2 2013-05-05 03:27:00 2013-05-14 00:16:00 8.86736111 days
#10      2 2013-05-14 10:00:00 2013-05-14 15:00:00 0.20833333 days

汇总时间序列的多个组内的数据

问题描述

2 个解决方案

解决方案1
5 已采纳 2019-06-23 22:21:26

解决方案2
1 2019-06-23 23:42:55

汇总时间序列的多个组内的数据

问题描述

2 个解决方案

解决方案1 5 已采纳 2019-06-23 22:21:26

解决方案2 1 2019-06-23 23:42:55

解决方案1
5 已采纳 2019-06-23 22:21:26

解决方案2
1 2019-06-23 23:42:55