简体   繁体   中英

How to extract from a large dataset hourly daily means by level from a factor variable and estimate other statistics

The dataframe df1 summarizes different datetimes in which people has gone to a public toilet for a specific period of time (let's say between "2017-06-01" and "2017-06-30"). The column Zone specifies the area where the toilet was placed, being a factor with two levels: A (a party area) or B (a residence area).

I show below a reproducible example of what I have. This example contains only two days to reduce the size of the example dataset. In order to create df1 I had first to create 4 separate dataframes and then bind them to create the dataframe df1 (I had error when a tried to create df1 at once). df1 has 193 rows.

options(digits.secs=3)
day_1_A<- data.frame(Datetime= ymd_hms(c("2017-06-01 00:04:17.986","2017-06-01 00:17:43.456","2017-06-01 00:22:43.456","2017-06-01 00:34:43.456","2017-06-01 00:45:43.456","2017-06-01 01:15:23.275","2017-06-01 01:41:32.609","2017-06-01 02:04:17.986","2017-06-01 02:17:43.456","2017-06-01 03:15:23.275","2017-06-01 03:41:32.609","2017-06-01 04:04:17.986","2017-06-01 04:17:43.456","2017-06-01 05:15:23.275","2017-06-01 05:41:32.609","2017-06-01 06:04:17.986","2017-06-01 06:17:43.456","2017-06-01 07:15:23.275","2017-06-01 07:41:32.609","2017-06-01 08:04:17.986","2017-06-01 08:17:43.456","2017-06-01 09:15:23.275","2017-06-01 09:41:32.609","2017-06-01 10:04:17.986","2017-06-01 10:17:43.456","2017-06-01 11:15:23.275","2017-06-01 11:41:32.609","2017-06-01 12:04:17.986","2017-06-01 12:17:43.456","2017-06-01 13:15:23.275","2017-06-01 13:41:32.609","2017-06-01 14:04:17.986","2017-06-01 14:17:43.456","2017-06-01 15:17:23.275","2017-06-01 15:41:32.609","2017-06-01 16:04:17.986","2017-06-01 16:17:43.456","2017-06-01 17:15:23.275","2017-06-01 17:41:32.609","2017-06-01 18:04:17.986","2017-06-01 18:17:43.456","2017-06-01 19:15:23.275","2017-06-01 19:41:32.609","2017-06-01 20:04:17.986","2017-06-01 20:17:43.456","2017-06-01 21:15:23.275","2017-06-01 21:41:32.609","2017-06-01 22:04:17.986","2017-06-01 22:17:43.456","2017-06-01 23:15:23.275","2017-06-01 23:41:32.609")),
                 ToiletZone = c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A"))

day_1_B<- data.frame(Datetime= ymd_hms(c("2017-06-01 00:04:17.986","2017-06-01 00:17:43.456","2017-06-01 01:15:23.275","2017-06-01 01:41:32.609","2017-06-01 02:04:17.986","2017-06-01 02:17:43.456","2017-06-01 03:15:23.275","2017-06-01 03:41:32.609","2017-06-01 04:04:17.986","2017-06-01 04:17:43.456","2017-06-01 05:15:23.275","2017-06-01 05:41:32.609","2017-06-01 06:04:17.986","2017-06-01 06:17:43.456","2017-06-01 07:15:23.275","2017-06-01 07:41:32.609","2017-06-01 08:04:17.986","2017-06-01 08:17:43.456","2017-06-01 09:15:23.275","2017-06-01 09:41:32.609","2017-06-01 10:04:17.986","2017-06-01 10:17:43.456","2017-06-01 11:15:23.275","2017-06-01 11:41:32.609","2017-06-01 12:04:17.986","2017-06-01 12:17:43.456","2017-06-01 13:15:23.275","2017-06-01 13:41:32.609","2017-06-01 14:04:17.986","2017-06-01 14:17:43.456","2017-06-01 15:15:23.275","2017-06-01 15:41:32.609","2017-06-01 16:04:17.986","2017-06-01 16:17:43.456","2017-06-01 17:15:23.275","2017-06-01 17:41:32.609","2017-06-01 18:04:17.986","2017-06-01 18:17:43.456","2017-06-01 19:15:23.275","2017-06-01 19:41:32.609","2017-06-01 20:04:17.986","2017-06-01 20:17:43.456","2017-06-01 21:15:23.275","2017-06-01 21:41:32.609","2017-06-01 22:04:17.986","2017-06-01 22:17:43.456","2017-06-01 23:15:23.275","2017-06-01 23:41:32.609")),
                 ToiletZone = c("B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B"))

day_2_A<- data.frame(Datetime= ymd_hms(c("2017-06-02 00:17:43.456","2017-06-02 00:48:43.456","2017-06-02 01:15:23.275","2017-06-02 01:52:23.275","2017-06-02 02:04:17.986","2017-06-02 02:17:43.456","2017-06-02 03:15:23.275","2017-06-02 03:41:32.609","2017-06-02 04:04:17.986","2017-06-02 04:17:43.456","2017-06-02 05:15:23.275","2017-06-02 05:41:32.609","2017-06-02 06:04:17.986","2017-06-02 06:17:43.456","2017-06-02 07:15:23.275","2017-06-02 07:41:32.609","2017-06-02 08:04:17.986","2017-06-02 08:17:43.456","2017-06-02 09:15:23.275","2017-06-02 09:41:32.609","2017-06-02 10:04:17.986","2017-06-02 10:17:43.456","2017-06-02 11:15:23.275","2017-06-02 11:41:32.609","2017-06-02 12:04:17.986","2017-06-02 12:17:43.456","2017-06-02 13:15:23.275","2017-06-02 13:41:32.609","2017-06-02 14:04:17.986","2017-06-02 14:17:43.456","2017-06-02 15:15:23.275","2017-06-02 15:41:32.609","2017-06-02 16:04:17.986","2017-06-02 16:17:43.456","2017-06-02 17:15:23.275","2017-06-02 17:41:32.609","2017-06-02 18:04:17.986","2017-06-02 18:17:43.456","2017-06-02 19:15:23.275","2017-06-02 19:41:32.609","2017-06-02 20:04:17.986","2017-06-02 20:17:43.456","2017-06-02 21:15:23.275","2017-06-02 21:41:32.609","2017-06-02 22:04:17.986","2017-06-02 22:17:43.456","2017-06-02 23:15:23.275","2017-06-02 23:41:32.609")),
                 ToiletZone = c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A"))

day_2_B<- data.frame(Datetime= ymd_hms(c("2017-06-02 00:04:17.986","2017-06-02 01:15:23.275","2017-06-02 02:04:17.986","2017-06-02 02:17:43.456","2017-06-02 03:15:23.275","2017-06-02 03:41:32.609","2017-06-02 04:04:17.986","2017-06-02 04:17:43.456","2017-06-02 05:15:23.275","2017-06-02 05:41:32.609","2017-06-02 06:04:17.986","2017-06-02 06:17:43.456","2017-06-02 07:15:23.275","2017-06-02 07:41:32.609","2017-06-02 08:04:17.986","2017-06-02 08:17:43.456","2017-06-02 09:15:23.275","2017-06-02 09:41:32.609","2017-06-02 10:04:17.986","2017-06-02 10:17:43.456","2017-06-02 11:15:23.275","2017-06-02 11:41:32.609","2017-06-02 12:04:17.986","2017-06-02 12:17:43.456","2017-06-02 13:15:23.275","2017-06-02 13:41:32.609","2017-06-02 14:04:17.986","2017-06-02 14:17:43.456","2017-06-02 15:15:23.275","2017-06-02 15:41:32.609","2017-06-02 16:04:17.986","2017-06-02 16:17:43.456","2017-06-02 17:15:23.275","2017-06-02 17:41:32.609","2017-06-02 18:04:17.986","2017-06-02 18:17:43.456","2017-06-02 19:15:23.275","2017-06-02 19:41:32.609","2017-06-02 20:04:17.986","2017-06-02 20:17:43.456","2017-06-02 21:15:23.275","2017-06-02 21:41:32.609","2017-06-02 22:04:17.986","2017-06-02 22:17:43.456","2017-06-02 23:15:23.275","2017-06-02 23:41:32.609")),
                 ToiletZone = c("B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B"))


df1<- rbind(day_1_A,day_1_B,day_2_A,day_2_B)
df1

> df1
                   Datetime ToiletZone
1   2017-06-01 00:04:17.986          A
2   2017-06-01 00:17:43.455          A
3   2017-06-01 00:22:43.455          A
4   2017-06-01 00:34:43.455          A
5   2017-06-01 00:45:43.455          A
6   2017-06-01 01:15:23.275          A
.               .                    .
.               .                    .
.               .                    .
193 2017-06-02 23:41:32.608          B

For some reasons I won't explain here, I need to calculate for EACH DAY and for EACH ZONE a statistic called θ , that could be defined as the coefficient of the division of the "average hourly number of visits to the toilet during the day" ( Hourly_daily_μ ) by the "average hourly number of visits for the entire period of interest" ( Overall_hourly_μ ).

I show in a picture what I would expect from the previous example (the columns Hourly_daily_μ_A , Hourly_daily_μ_B , Overall_hourly_μ_A and Overall_hourly_μ_A are incorporated to clarify the calculations. The columns that I really need are θ_A and θ_B ): 在此处输入图片说明

Why Hourly_daily_μ_A is 51/24 on 2017-06-01? Because this day there were 51 persons that went to the toilet. Hence, if we divide between 24 we get the hourly mean of people that went to the toilet this day.

Why Overall_hourly_μ_A is the same for each zone for the different days? Because it is an overall mean for each zone. Here we want to know what is the general average of people that go to the toilet per hour. In this example, we know that 99 persons went to the toilet between the 1st June and the 2nd June in the Zone A. So we divide this between the total number of hours (48 hours in the example) and we get the overall hourly mean of people that go to the toilet in the zone A. It is a unique value for each Zone.

Why θ_A is (51*48)/(24*99) on the 2017-06-01? Because is the result of dividing Hourly_daily_μ_A (51/24) by Overall_hourly_μ_A (99/48).

Does anyone know how to do it? My dataframe is quite large so I guess that the package data.table could be a good option.

An option would be do group by frequency count, do some calculations to get the expected output

library(dplyr)
library(tidyr)
library(lubridate)
df1 %>% 
  mutate(Date = floor_date(Datetime, "hour")) %>% 
  group_by(ToiletZone, Date) %>% 
  mutate(hourlyCount = n(), HourlyAvg = hourlyCount/24) %>% 
  group_by(ToiletZone) %>% 
  mutate(Total = sum(hourlyCount)/ n() * 24) %>% 
  group_by(Date = as.Date(Date), add = TRUE) %>% 
  summarise(Theta = hourlyCount[1]/Total[1]) %>%
  spread(ToiletZone, Theta)

I think you only need to floor your dates to a day unit and then you can use it for grouping. With data.table :

setDT(df1)

df1[, Date := floor_date(Datetime, "day")]

daily <- df1[, .(DailyCount = .N, DailyAvg = .N / 24), by = .(ToiletZone, Date)]
overall <- daily[, .(Total = sum(DailyCount) / (.N * 24)), by = .(ToiletZone)]

overall[daily, .(ToiletZone, Date, Theta = DailyAvg / Total), on = "ToiletZone"]
   ToiletZone       Date     Theta
1:          A 2017-06-01 1.0303030
2:          B 2017-06-01 1.0212766
3:          A 2017-06-02 0.9696970
4:          B 2017-06-02 0.9787234

And hourly would be similar, just change floor_date and adjust some denominators:

df1[, Date := floor_date(Datetime, "hour")]

hourly <- df1[, .(HourlyCount = .N), by = .(ToiletZone, Date)]
overall <- hourly[, .(Total = sum(HourlyCount) / .N), by = "ToiletZone"]

ans <- overall[hourly, .(ToiletZone, Date, Theta = HourlyCount / Total), on = "ToiletZone"]

BTW, the last lines are a join, you can think of them as a left join with, respectively, daily and hourly as the left-hand table.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM