如果開始和結束時間可用，R 每天匯總數據

Question

我有以下問題。 我有以下結構的數據框：

        startdatetime         enddatetime  type amount
1 2019-02-01 03:35:00 2019-02-03 06:35:00 prod1  1e+03
2 2019-02-03 06:35:00 2019-02-05 09:35:00 prod1  5e+03
3 2019-02-05 09:35:00 2019-02-06 01:35:00 prod2  3e+07
4 2019-02-06 01:35:00 2019-02-06 03:35:00 prod1  1e+02

表示在特定時間跨度（開始日期時間和結束日期時間）內產生的數量。 現在我想每天匯總這些數據。 讓我們忽略不完整的一天 2019-02-01 並從 2019-02-02 開始。 第一個產品 1 在 2019-02-01 03:35:00 和 2019-02-03 06:35:00 之間生產，總共生產了 1000 kg。 因此，例如，在 2019-02-02： 24/51*1000 = 470.58 of prod 1 被生產，因為24h + 21h + 6h = 51h 。 到目前為止，我的解決方案是基於 for 和 while 循環，但我想有一個基於包“lubridate”或其他我沒有找到的更快的解決方案。 有什么建議嗎？ 在我的代碼下面

#create test data set
mydata <- data.frame(startdatetime=c(as.POSIXct("2019-02-01 03:35:00"), as.POSIXct("2019-02-03 06:35:00"),as.POSIXct("2019-02-05 09:35:00"),as.POSIXct("2019-02-06 01:35:00")),
                     enddatetime  =c(as.POSIXct("2019-02-03 06:35:00"), as.POSIXct("2019-02-05 09:35:00"),as.POSIXct("2019-02-06 01:35:00"),as.POSIXct("2019-02-06 03:35:00")),
                     type=c("prod1","prod1","prod2","prod1"),
                     amount=c(1000,5000,30000000,100)) 

# take only full days into account and ignore the first and the last day
minstartday = min(mydata$startdatetime)+24*60*60
maxendday   = max(mydata$enddatetime)-24*60*60

#create a day index
timesindex <- seq(from = as.Date(format(minstartday, format = "%Y/%m/%d")), 
                  to   = as.Date(format(maxendday, format = "%Y/%m/%d")), by = "day")

# create an empty dataframe which will be filled with the production data for each day
prodperday <- data.frame(Date=as.Date(timesindex),
                         prod1=replicate(length(timesindex),0), 
                         prod2=replicate(length(timesindex),0), 
                         stringsAsFactors=FALSE) 

# loop over all entries and separate them into produced fractions per day
for (irow in 1:dim(mydata)[1]){
  timestart = mydata[irow,"startdatetime"]
  datestart = as.Date(format(timestart, format = "%Y/%m/%d"))
  timeend = timestart
  tota_run_time_in_h = (as.numeric((mydata[irow,"enddatetime"]-mydata[irow,"startdatetime"])))*24.
  while (timeend < mydata[irow,"enddatetime"]){
    timeend = min (as.POSIXct(datestart, format = "%Y/%m/%d %H:%M:%S")+23*60*60-1,
                   mydata[irow,"enddatetime"])
    tdiff = as.numeric(timeend-timestart)
    fraction_prod = (tdiff/tota_run_time_in_h)*mydata[irow,"amount"]
    if (datestart %in% prodperday$Date){
      prodperday[prodperday$Date == datestart,as.character(mydata[irow,"type"])] = 
        prodperday[prodperday$Date == datestart,as.character(mydata[irow,"type"])] + fraction_prod
    }

    timestart = timeend+1
    datestart = as.Date(format(timestart, format = "%Y/%m/%d"))
    timeend = timestart
  }
}

結果：

        Date     prod1   prod2
1 2019-02-02  470.5828       0
2 2019-02-03 1836.5741       0
3 2019-02-04 2352.9139       0
4 2019-02-05  939.5425 1126280

Answer 1

我提出的解決方案並不完美，因為存在邊界問題，但將生產中的數據按小時轉換並按天匯總后的想法可能是一個好主意。

我使用兩個庫：

library(lubridate)
library(dplyr)

參考時間：

ref.times <- seq(from = min(mydata$startdatetime),
           to = max(mydata$enddatetime),
           by = "hour")

以小時為單位構建數據庫：

newdata <- data.frame(hour = floor_date(ref.times, unit = "hour"),
                      prod1 = 0,
                      prod2 = 0,
                      day = floor_date(newdata$hour, unit= "day"))
for(i in 1:nrow(mydata)){
  ref.times <- seq(from = mydata$startdatetime[i],
                   to = mydata$enddatetime[i],
                   by = "hour")
  n <- length(floor_date(ref.times, "hour"))
  if(mydata[i, 3] == "prod1"){
    newdata[newdata$hour %in%  floor_date(ref.times, unit = "hour"), 2] <-
      rep(mydata[i, 4] / n, n)
  }else{
    newdata[newdata$hour %in%  floor_date(ref.times, unit = "hour"), 3] <-
      rep(mydata[i, 4] / n, n)
  }
}

按天聚合：

newdata %>% group_by(day) %>% summarise(prod1 = sum(prod1),
                                        prod2 = sum(prod2))

Answer 2

這是我會做的：

您知道開始日期使用24-starttime開始時間生產小時數。 結束日期使用endtime時間，而其間的所有日子顯然都使用 24 小時。 所以很容易計算。

library(lubridate)
library(tidyverse)

pmap_dfr(mydata, ~ {
  hours       <- abs(as.numeric(difftime(..1, ..2, units = "hours")))
  day_seq     <- seq(as_date(..1), as_date(..2), by = "days")
  hours_start <- hour(..1) + minute(..1) / 60
  hours_end   <- hour(..2) + minute(..2) / 60

  production  <- c(
    ..4 * (24 - hours_start) / hours,
    rep(..4 * 24 / hours, max(length(day_seq) - 2, 0)),
    ..4 * hours_end / hours
  )
  tibble(
    day = day_seq,
    amount = production,
    type = ..3
  )
}) %>%
  group_by(day, type) %>%
  summarise(amount = sum(amount)) %>%
  spread(type, amount) %>%
  replace_na(list(prod1 = 0, prod2 = 0))


# A tibble: 6 x 3
# Groups:   day [6]
  day        prod1     prod2
  <date>     <dbl>     <dbl>
1 2019-02-01  400.        0 
2 2019-02-02  471.        0 
3 2019-02-03 1837.        0 
4 2019-02-04 2353.        0 
5 2019-02-05  940. 27031250 
6 2019-02-06 1300.  2968750.

如果您想這樣做，最后可以刪除第一個和最后一個條目。

如果開始和結束時間可用，R 每天匯總數據

問題描述

2 個解決方案

解決方案1
1 2019-09-13 08:54:04

解決方案2
1 已采納 2019-09-13 09:18:42

如果開始和結束時間可用，R 每天匯總數據

問題描述

2 個解決方案

解決方案1 1 2019-09-13 08:54:04

解決方案2 1 已采納 2019-09-13 09:18:42

解決方案1
1 2019-09-13 08:54:04

解決方案2
1 已采納 2019-09-13 09:18:42