简体   繁体   English

R:将观察值除以并汇总到时间间隔

[英]R: Split observation values by and aggregate to time intervals

There are bird observations from various observation points ( obs ) over certain areas ( name ). 在某些区域( 名称 )有来自各个观测点( obs )的鸟类观测。 The start and end time was taken, and the time difference ( diff_corr ) recalculated with a correction factor, so it is not simply difftime of the start-end-interval. 记录了开始时间和结束时间,并使用校正因子重新计算了时间差( diff_corr ),因此它不仅仅是开始-结束间隔的difftime

I now need to "split" these values to "nice" intervals (15 minutes, eg 10:15:00, 10:30:00, ...) and then aggregate area-wise( name ) in order be able to make a plot of the presence of birds on those areas in those clean 15-minute-intervals. 现在,我需要将这些值“拆分”为“不错”的间隔(15分钟,例如10:15:00、10:30:00等),然后按区域聚合( 名称 ),以便能够以15分钟为间隔的间隔在这些区域出现鸟类的图。

So, to make it a little more clear: An observation might start at 10:14 and goes till 10:25, so it spans over the interval 10:00-10:15 and 10:15-10:30, so the value I got should be split and assigned accordingly to the appropriate intervals by the part they have into that interval. 因此,更清楚一点:观察可能始于10:14,一直持续到10:25,因此它跨越了10:00-10:15和10:15-10:30的时间间隔,因此该值应该将我分成几个部分,并根据他们在该间隔中所分配的部分,将其相应地分配给相应的间隔。

In a more complicated setting, an observation might span over 3 or 4 intervals, and so the value has to be split there accordingly as well. 在更复杂的设置中,观察值可能跨越3或4个间隔,因此该值也必须在此相应地拆分。

The last step would be to aggregate all observation parts per interval and plot them. 最后一步是汇总每个时间间隔的所有观测部分并绘制它们。

I already searched for solutions for some days, but only found very simplistic examples where intervals were rearranged with cut and breaks , but never examples what to do with associated values, but simple frequency counts. 我已经搜索了几天的解决方案,但是只发现了非常简单的示例,其中间隔通过cutbreaks进行了重新排列,但从未找到如何处理关联值的示例,而是简单的频率计数。

example data: 示例数据:

structure(list(obs = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("b", 
"C2", "Dürnberg2"), class = "factor"), name = c("C2", "C2", 
"C2", "C2", "C2", "C2", "C2", "C2", "C2", "b", "981", "1627", 
"b", "b", "981", "1627", "b", "b", "b", "b"), start = structure(c(1495441500, 
1495441590, 1495441650, 1495441680, 1495447380, 1495447410, 1495447530, 
1495447560, 1495447580, 1496996580, 1496996580, 1496996580, 1496996760, 
1496996820, 1496996820, 1496996820, 1496997180, 1496997300, 1496997420, 
1496998260), class = c("POSIXct", "POSIXt"), tzone = ""), end = structure(c(1495441590, 
1495441650, 1495441680, 1495441800, 1495447410, 1495447530, 1495447560, 
1495447580, 1495447620, 1496996760, 1496996760, 1496996760, 1496996820, 
1496997180, 1496997180, 1496997180, 1496997300, 1496997420, 1496997540, 
1496998320), class = c("POSIXct", "POSIXt"), tzone = ""), diff_corr = c(1.46739130434783, 
0.978260869565217, 0.489130434782609, 1.95652173913043, 0.489130434782609, 
1.95652173913043, 0.489130434782609, 0.326086956521739, 0.652173913043478, 
2.96703296703297, 2.96703296703297, 2.96703296703297, 0.989010989010989, 
5.93406593406593, 5.93406593406593, 5.93406593406593, 1.97802197802198, 
1.97802197802198, 1.97802197802198, 0.989010989010989)), .Names = c("obs", 
"name", "start", "end", "diff_corr"), row.names = c("1", "9", 
"7", "8", "3", "2", "4", "5", "6", "13", "13.1", "13.2", "22", 
"11", "11.1", "11.2", "12", "23", "15", "16"), class = "data.frame")

ps I have real difficulties to name my question properly, so any hints (not only on that) are highly appreciated ps我确实很难为我的问题正确命名,所以任何提示(不仅限于此)都受到高度赞赏

New attempt for a small example: Assigning the value to intervals by their proportion (and later sum up equal intervals) 一个小例子的新尝试:按间隔比例将值分配给间隔(然后求和等于间隔)

start         end         value     new values in new 15-min-intervals
10:03:00      10:14:00    11        ---> 10:00:00 =  11
10:14:00      10:16:00     2        ---> 10:00:00 = 1 ; 10:15:00 = 1
10:00:00      10:35:00    40        ---> 10:00:00 = 40/35*15 ; 10:15:00 = 40/35*15 ; 10:30:00 = 40/35*5
10:15:00      10:30:00    12        ---> 10:15:00 = 12

This is slow and clunky, but maybe it's helpful. 这既缓慢又笨拙,但也许会有所帮助。 Calculates counts and weighted diff_corr sums by name and 15 minute interval: 按名称和15分钟间隔计算计数和加权diff_corr总和:

library(dplyr)
range <- seq.POSIXt(min(df$start)-(15*60), max(df$end)+(15*60), by = "15 min")

df$totalDuration <- as.numeric(as.difftime(df$end-df$start),units=c("secs"))

out <- NULL
for (r in 1:length(range)){
  subset <- df %>% filter( (start >= (range[r]-(15*60)) & start<range[r]) |
                             (end>= (range[r]-(15*60)) & end<range[r] ) |
                             (end > range[r] & start < range[r])) %>%
    mutate(bin=range[r],
           duration = ifelse(start>=(range[r]-(15*60)) & end<range[r],totalDuration,
                        ifelse(start>=(range[r]-(15*60)),as.numeric(as.difftime(range[r]-start),units="secs"),
                          ifelse(end<range[r],
                                 as.numeric(as.difftime(end-(range[r]-(15*60))),units="secs"),
                                            as.numeric(as.difftime(range[r]-(range[r]-(15*60))),units="secs")
                        )))
           ) %>% 
    mutate (diff_corr_W = diff_corr*(duration/as.double(totalDuration, units='secs'))) %>%
    group_by(bin,name) %>% summarise(count=n(),
                                     diff_corr_sum = sum(diff_corr_W)) %>% ungroup()


  if (is.null(out)){
    out <- subset
  } else {
    out <- rbind(out,subset)
  }
}


> out
# A tibble: 9 x 4
bin  name count diff_corr_sum
*              <dttm> <chr> <int>         <dbl>
  1 2017-05-22 04:40:00    C2     4      4.891304
2 2017-05-22 06:10:00    C2     5      3.913043
3 2017-06-09 04:25:00  1627     1      1.978022
4 2017-06-09 04:25:00   981     1      1.978022
5 2017-06-09 04:25:00     b     1      1.978022
6 2017-06-09 04:40:00  1627     2      6.923077
7 2017-06-09 04:40:00   981     2      6.923077
8 2017-06-09 04:40:00     b     6     13.846154
9 2017-06-09 04:55:00     b     1      0.989011

Here's a data.table approach which allows you to use SQL-type queries to sort/filter data and perform operations. 这是一个data.table方法,它允许您使用SQL类型的查询来排序/过滤数据并执行操作。

DATA 数据

> p
    obs name               start                 end diff_corr
 1:  C2   C2 2017-05-22 04:25:00 2017-05-22 04:26:30 1.4673913
 2:  C2   C2 2017-05-22 04:26:30 2017-05-22 04:27:30 0.9782609
 3:  C2   C2 2017-05-22 04:27:30 2017-05-22 04:28:00 0.4891304
 4:  C2   C2 2017-05-22 04:28:00 2017-05-22 04:30:00 1.9565217
 5:  C2   C2 2017-05-22 06:03:00 2017-05-22 06:03:30 0.4891304
 6:  C2   C2 2017-05-22 06:03:30 2017-05-22 06:05:30 1.9565217
 7:  C2   C2 2017-05-22 06:05:30 2017-05-22 06:06:00 0.4891304
 8:  C2   C2 2017-05-22 06:06:00 2017-05-22 06:06:20 0.3260870
 9:  C2   C2 2017-05-22 06:06:20 2017-05-22 06:07:00 0.6521739
10:   b    b 2017-06-09 04:23:00 2017-06-09 04:26:00 2.9670330
11:   b  981 2017-06-09 04:23:00 2017-06-09 04:26:00 2.9670330
12:   b 1627 2017-06-09 04:23:00 2017-06-09 04:26:00 2.9670330
13:   b    b 2017-06-09 04:26:00 2017-06-09 04:27:00 0.9890110
14:   b    b 2017-06-09 04:27:00 2017-06-09 04:33:00 5.9340659
15:   b  981 2017-06-09 04:27:00 2017-06-09 04:33:00 5.9340659
16:   b 1627 2017-06-09 04:27:00 2017-06-09 04:33:00 5.9340659
17:   b    b 2017-06-09 04:33:00 2017-06-09 04:35:00 1.9780220
18:   b    b 2017-06-09 04:35:00 2017-06-09 04:37:00 1.9780220
19:   b    b 2017-06-09 04:37:00 2017-06-09 04:39:00 1.9780220
20:   b    b 2017-06-09 04:51:00 2017-06-09 04:52:00 0.9890110

CODE

library(data.table)
library(lubridate)
p <- as.data.table(p)
p[, .(new_diff = mean(diff_corr)), .(tme_start = round_date(start, unit = "15min"))]

OUTPUT 输出值

> p[, .(new_diff = mean(diff_corr)), .(tme_start = round_date(start, unit = "15min"))]
             tme_start  new_diff
1: 2017-05-22 04:30:00 1.2228261
2: 2017-05-22 06:00:00 0.7826087
3: 2017-06-09 04:30:00 3.3626374
4: 2017-06-09 04:45:00 0.9890110

What is Data.Table doing? Data.Table在做什么?

Since you aren't familiar with data.table , here's a very quick, elementary description of what is happening. 由于您不熟悉data.table ,因此这里是对发生的事情的非常简单的基本描述。 General form of the data.table call is: data.table调用的一般形式为:

DT[select rows, perform operations, group by] 

Where DT is the data.table name. 其中DTdata.table名称。 Select rows is a logical operation eg say you want only observations for C2 (name), the call would be DT[name == "C2",] There is no operation required to be performed and no grouping. Select rows是一种逻辑操作,例如说您只希望观察C2(名称),则调用将为DT[name == "C2",]无需执行任何操作,也无需分组。 If you want the sum of diff_corr column for all name == "C2" , the call becomes DT[name == "C2", list(sum(diff_corr))] . 如果您希望所有name == "C2"diff_corr列的总和,调用将成为DT[name == "C2", list(sum(diff_corr))] Instead of writing list() you can use .() . 除了编写list()还可以使用.() The output will now have a only one row and one column called V1 which is the sum of all diff_corr when name == "C2" . 现在输出将只有一行和一列,称为V1 ,这是name == "C2"时所有diff_corr的总和。 The column doesn't have a lot of information so we assign it a name (can be the same as the old one): DT[name == "C2", .(diff_corr_sum = sum(diff_corr))] . 该列没有很多信息,因此我们为它分配一个名称(可以与旧名称相同): DT[name == "C2", .(diff_corr_sum = sum(diff_corr))] Suppose you had another column called "mood" which reported the mood of the person making the observation and can assume three values ("happy", "sad", "sleepy"). 假设您还有一个名为“ mood”的列,该列报告了进行观察的人的心情,并且可以假设三个值(“ happy”,“ sad”,“ sleepy”)。 You could "group by" the mood: DT[name == "C2", .(diff_corr_new = sum(diff_corr)), by = .(mood)] . 您可以按心情“分组”: DT[name == "C2", .(diff_corr_new = sum(diff_corr)), by = .(mood)] The output would be three rows corresponding to each of the moods and one column diff_corr_new . 输出将是对应于每种心情的三行和一列diff_corr_new To understand this better try playing around with a sample dataset like mtcars . 为了更好地理解这一点,请尝试使用mtcars这样的样本数据集。 Your sample data doesn't have enough complexity etc. to allow you to explore all of these functions. 您的样本数据没有足够的复杂性等,因此您无法探索所有这些功能。

Back to the answer - other variations 返回答案-其他变化

It's not clear from the question or comments if you want to round based on start or end . 从问题或注释中尚不清楚您是否要基于startend四舍五入。 I used the former but you can change that. 我使用了前者,但您可以更改它。 The example above uses mean but you can perform any other operations you may need. 上面的示例使用了mean但是您可以执行可能需要的任何其他操作。 The other columns seem more or less redundant since they are strings and you can't do much with them. 其他列似乎或多或少是多余的,因为它们是字符串,您不能对它们做太多事情。 You could use them to further sort the results in the by entry (last field in the code). 您可以使用它们在by条目(代码的最后一个字段)中进一步对结果进行排序。 Below are two examples using obs and name respectively. 以下是分别使用obsname两个示例。 You can also combine all of them together. 您也可以将它们全部组合在一起。

> p[, .(new_diff = mean(diff_corr)), .(tme_start = round_date(start, unit = "15min"), obs)]
             tme_start obs  new_diff
1: 2017-05-22 04:30:00  C2 1.2228261
2: 2017-05-22 06:00:00  C2 0.7826087
3: 2017-06-09 04:30:00   b 3.3626374
4: 2017-06-09 04:45:00   b 0.9890110


> p[, .(new_diff = mean(diff_corr)), .(tme_start = round_date(start, unit = "15min"), name)]
             tme_start name  new_diff
1: 2017-05-22 04:30:00   C2 1.2228261
2: 2017-05-22 06:00:00   C2 0.7826087
3: 2017-06-09 04:30:00    b 2.6373626
4: 2017-06-09 04:30:00  981 4.4505495
5: 2017-06-09 04:30:00 1627 4.4505495
6: 2017-06-09 04:45:00    b 0.9890110

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM