简体   繁体   中英

get the number of total observations per day and calculate a column

I have a column in a large csv-file with date (yyyy-mm-dd hh:mm:ss), I need to get the number of total observations per each day and also calculate the "duration" column for every day sample is attached . what should I do?

the output of dput(head(dataset, 30)) :

structure(list(started_at = c("2021-01-01 04:08:36.084000+00:00","2021-01-01 04:21:25.006000+00:00", "2021-01-01 04:21:54.089000+00:00","2021-01-01 04:51:25.030000+00:00", "2021-01-01 05:22:32.625000+00:00","2021-01-01 05:54:48.758000+00:00", "2021-01-01 06:01:14.836000+00:00","2021-01-01 06:01:37.851000+00:00", "2021-01-01 06:04:05.662000+00:00","2021-01-01 06:11:18.277000+00:00", "2021-01-01 06:33:30.347000+00:00","2021-01-01 06:45:07.924000+00:00", "2021-01-01 06:45:14.878000+00:00","2021-01-01 06:50:13.093000+00:00", "2021-01-01 07:02:30.784000+00:00","2021-01-01 07:07:10.308000+00:00", "2021-01-01 07:42:12.724000+00:00","2021-01-01 08:01:29.792000+00:00", "2021-01-01 08:54:05.033000+00:00","2021-01-01 08:58:21.037000+00:00", "2021-01-01 09:11:20.838000+00:00","2021-01-01 09:41:59.418000+00:00", "2021-01-01 09:44:07.865000+00:00","2021-01-01 09:46:52.589000+00:00", "2021-01-01 09:53:44.537000+00:00","2021-01-01 09:59:50.803000+00:00", "2021-01-01 10:15:04.057000+00:00","2021-01-01 10:17:30.534000+00:00", "2021-01-01 10:29:23.197000+00:00","2021-01-01 10:38:02.688000+00:00"), ended_at = c("2021-01-01 04:13:55.831000+00:00","2021-01-01 04:38:11.797000+00:00", "2021-01-01 04:38:08.703000+00:00","2021-01-01 05:02:17.441000+00:00", "2021-01-01 05:25:08.535000+00:00","2021-01-01 06:04:06.018000+00:00", "2021-01-01 06:07:59.274000+00:00","2021-01-01 06:05:31.123000+00:00", "2021-01-01 06:17:36.169000+00:00","2021-01-01 06:20:06.537000+00:00", "2021-01-01 06:35:02.616000+00:00","2021-01-01 06:51:08.737000+00:00", "2021-01-01 06:53:49.018000+00:00","2021-01-01 06:58:53.912000+00:00", "2021-01-01 07:22:37.883000+00:00","2021-01-01 07:15:23.471000+00:00", "2021-01-01 07:49:26.006000+00:00","2021-01-01 08:19:02.049000+00:00", "2021-01-01 09:03:10.272000+00:00","2021-01-01 08:59:27.370000+00:00", "2021-01-01 09:14:36.520000+00:00","2021-01-01 09:54:56.635000+00:00", "2021-01-01 09:50:53.671000+00:00","2021-01-01 10:04:31.130000+00:00", "2021-01-01 09:59:15.929000+00:00","2021-01-01 10:08:24.381000+00:00", "2021-01-01 10:31:29.582000+00:00","2021-01-01 10:33:47.731000+00:00", "2021-01-01 10:48:17.963000+00:00","2021-01-01 11:16:05.789000+00:00"), duration = c(319L, 1006L,974L, 652L, 155L, 557L, 404L, 233L, 810L, 528L, 92L, 360L, 514L,520L, 1207L, 493L, 433L, 1052L, 545L, 66L, 195L, 777L, 405L,1058L, 331L, 513L, 985L, 977L, 1134L, 2283L)), row.names = c(NA,30L), class = "data.frame")

Three things:

  1. Your data is all from one day, so summarizing it per-day is uninformative. I'll change the second half to be the next day to show by-day aggregation.
  2. Your columns are strings, you need to convert them to POSIXt in order to bin them properly (by day or by hour).
  3. I'll work with started_at only for aggregation; if you have data that spans midnight, then you may need to determine what your summing logic should be.

dplyr

library(dplyr)
dataset <- dataset %>%
  mutate(
    # note 2, string-to-POSIXct
    across(c(started_at, ended_at),
           ~ as.POSIXct(sub(":([0-9]+)$", "\\1", .), format = "%Y-%m-%d %H:%M:%OS%z", tz = "UTC")),
    # note 1, change some from day 1 to day 2; solely for demonstration here, do not use
    across(c(started_at, ended_at),
           ~ { .x[16:30] <- .x[16:30] + 86400; .x; })
  )
head(dataset)
#            started_at            ended_at duration
# 1 2021-01-01 04:08:36 2021-01-01 04:13:55      319
# 2 2021-01-01 04:21:25 2021-01-01 04:38:11     1006
# 3 2021-01-01 04:21:54 2021-01-01 04:38:08      974
# 4 2021-01-01 04:51:25 2021-01-01 05:02:17      652
# 5 2021-01-01 05:22:32 2021-01-01 05:25:08      155
# 6 2021-01-01 05:54:48 2021-01-01 06:04:06      557

Then the grouping and aggregation:

days <- as.POSIXct(c("2021-01-01 00:00:00", "2021-01-03 00:00:00"), tz = "UTC")
days <- seq(days[1], days[2], by = "1 day")
dataset %>%
  mutate(day = days[ findInterval(started_at, days) ]) %>%
  group_by(day) %>%
  summarize(duration = sum(duration))
# # A tibble: 2 x 2
#   day                 duration
#   <dttm>                 <int>
# 1 2021-01-01 00:00:00     8331
# 2 2021-01-02 00:00:00    11247

base R

# note 2, string-to-POSIXct
dataset[1:2] <- lapply(dataset[1:2],
                       function(z) as.POSIXct(sub(":([0-9]+)$", "\\1", z), format = "%Y-%m-%d %H:%M:%OS%z", tz = "UTC"))
# note 1,  change some from day 1 to day 2; solely for demonstration here, do not use
dataset[1:2] <- lapply(dataset[1:2],
                       function(z) { z[16:30] <- z[16:30] + 86400; z; })

Aggregation:

days <- as.POSIXct(c("2021-01-01 00:00:00", "2021-01-03 00:00:00"), tz = "UTC")
days <- seq(days[1], days[2], by = "1 day")
dataset$day <- days[ findInterval(dataset$started_at, days) ]
aggregate(duration ~ day, data = dataset, FUN = sum)
#          day duration
# 1 2021-01-01     8331
# 2 2021-01-02    11247

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM