简体   繁体   中英

R aggregate second data to minutes more efficient

I have a data.table, allData , containing data on roughly every (POSIXct) second from different nights. Some nights however are on the same date since data is collected from different people, so I have a column nightNo as an id for every different night.

          timestamp  nightNo    data1     data2
2018-10-19 19:15:00        1        1         7
2018-10-19 19:15:01        1        2         8
2018-10-19 19:15:02        1        3         9
2018-10-19 18:10:22        2        4        10
2018-10-19 18:10:23        2        5        11 
2018-10-19 18:10:24        2        6        12

I'd like to aggregate the data to minutes (per night) and using this question I've come up with the following code:

aggregate_minute <- function(df){
  df %>% 
    group_by(timestamp = cut(timestamp, breaks= "1 min")) %>%
    summarise(data1= mean(data1), data2= mean(data2)) %>%
    as.data.table()
 }

allData <- allData[, aggregate_minute(allData), by=nightNo]

However my data.table is quite large and this code isn't fast enough. Is there a more efficient way to solve this problem?

allData <- data.table(timestamp = c(rep(Sys.time(), 3), rep(Sys.time() + 320, 3)), 
                     nightNo = rep(1:2, c(3, 3)),
                     data1 = 1:6,
                     data2  = 7:12)
                 timestamp nightNo data1 data2
1: 2018-06-14 10:43:11       1     1     7
2: 2018-06-14 10:43:11       1     2     8
3: 2018-06-14 10:43:11       1     3     9
4: 2018-06-14 10:48:31       2     4    10
5: 2018-06-14 10:48:31       2     5    11
6: 2018-06-14 10:48:31       2     6    12


allData[, .(data1 = mean(data1), data2 = mean(data2)), by = .(nightNo, timestamp = cut(timestamp, breaks= "1 min"))]
       nightNo           timestamp data1 data2
1:       1 2018-06-14 10:43:00     2     8
2:       2 2018-06-14 10:48:00     5    11

> system.time(replicate(500, allData[, aggregate_minute(allData), by=nightNo]))
    user  system elapsed 
    3.25    0.02    3.31 

> system.time(replicate(500, allData[, .(data1 = mean(data1), data2 = mean(data2)), by = .(nightNo, timestamp = cut(timestamp, breaks= "1 min"))]))
     user  system elapsed 
     1.02    0.04    1.06 

You can use lubridate to 'round' the dates and then use data.table to aggregate the columns.

library(data.table)  
library(lubridate)

Reproducible data:

text <- "timestamp  nightNo    data1     data2
'2018-10-19 19:15:00'        1        1         7
'2018-10-19 19:15:01'        1        2         8
'2018-10-19 19:15:02'        1        3         9
'2018-10-19 18:10:22'        2        4        10
'2018-10-19 18:10:23'        2        5        11 
'2018-10-19 18:10:24'        2        6        12"


allData <- read.table(text = text, header = TRUE, stringsAsFactors = FALSE)

Create data.table :

setDT(allData)

Create a timestamp and floor it to the nearest minute:

allData[, timestamp := floor_date(ymd_hms(timestamp), "minutes")]

Change the type of the integer columns to numeric :

allData[, ':='(data1 = as.numeric(data1), 
               data2 = as.numeric(data2))]

Replace the data columns with their means by nightNo group:

allData[, ':='(data1 = mean(data1), 
               data2 = mean(data2)),
        by = nightNo]

The result is:

             timestamp nightNo data1 data2
1: 2018-10-19 19:15:00       1     2     8
2: 2018-10-19 19:15:00       1     2     8
3: 2018-10-19 19:15:00       1     2     8
4: 2018-10-19 18:10:00       2     5    11
5: 2018-10-19 18:10:00       2     5    11
6: 2018-10-19 18:10:00       2     5    11

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM