The dataframe df1
summarizes different datetimes in which people has gone to a public toilet for a specific period of time (let's say between "2017-06-01" and "2017-06-30"). The column Zone
specifies the area where the toilet was placed, being a factor with two levels: A
(a party area) or B
(a residence area).
I show below a reproducible example of what I have. This example contains only two days to reduce the size of the example dataset. In order to create df1
I had first to create 4 separate dataframes and then bind them to create the dataframe df1
(I had error when a tried to create df1
at once). df1
has 193 rows.
options(digits.secs=3)
day_1_A<- data.frame(Datetime= ymd_hms(c("2017-06-01 00:04:17.986","2017-06-01 00:17:43.456","2017-06-01 00:22:43.456","2017-06-01 00:34:43.456","2017-06-01 00:45:43.456","2017-06-01 01:15:23.275","2017-06-01 01:41:32.609","2017-06-01 02:04:17.986","2017-06-01 02:17:43.456","2017-06-01 03:15:23.275","2017-06-01 03:41:32.609","2017-06-01 04:04:17.986","2017-06-01 04:17:43.456","2017-06-01 05:15:23.275","2017-06-01 05:41:32.609","2017-06-01 06:04:17.986","2017-06-01 06:17:43.456","2017-06-01 07:15:23.275","2017-06-01 07:41:32.609","2017-06-01 08:04:17.986","2017-06-01 08:17:43.456","2017-06-01 09:15:23.275","2017-06-01 09:41:32.609","2017-06-01 10:04:17.986","2017-06-01 10:17:43.456","2017-06-01 11:15:23.275","2017-06-01 11:41:32.609","2017-06-01 12:04:17.986","2017-06-01 12:17:43.456","2017-06-01 13:15:23.275","2017-06-01 13:41:32.609","2017-06-01 14:04:17.986","2017-06-01 14:17:43.456","2017-06-01 15:17:23.275","2017-06-01 15:41:32.609","2017-06-01 16:04:17.986","2017-06-01 16:17:43.456","2017-06-01 17:15:23.275","2017-06-01 17:41:32.609","2017-06-01 18:04:17.986","2017-06-01 18:17:43.456","2017-06-01 19:15:23.275","2017-06-01 19:41:32.609","2017-06-01 20:04:17.986","2017-06-01 20:17:43.456","2017-06-01 21:15:23.275","2017-06-01 21:41:32.609","2017-06-01 22:04:17.986","2017-06-01 22:17:43.456","2017-06-01 23:15:23.275","2017-06-01 23:41:32.609")),
ToiletZone = c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A"))
day_1_B<- data.frame(Datetime= ymd_hms(c("2017-06-01 00:04:17.986","2017-06-01 00:17:43.456","2017-06-01 01:15:23.275","2017-06-01 01:41:32.609","2017-06-01 02:04:17.986","2017-06-01 02:17:43.456","2017-06-01 03:15:23.275","2017-06-01 03:41:32.609","2017-06-01 04:04:17.986","2017-06-01 04:17:43.456","2017-06-01 05:15:23.275","2017-06-01 05:41:32.609","2017-06-01 06:04:17.986","2017-06-01 06:17:43.456","2017-06-01 07:15:23.275","2017-06-01 07:41:32.609","2017-06-01 08:04:17.986","2017-06-01 08:17:43.456","2017-06-01 09:15:23.275","2017-06-01 09:41:32.609","2017-06-01 10:04:17.986","2017-06-01 10:17:43.456","2017-06-01 11:15:23.275","2017-06-01 11:41:32.609","2017-06-01 12:04:17.986","2017-06-01 12:17:43.456","2017-06-01 13:15:23.275","2017-06-01 13:41:32.609","2017-06-01 14:04:17.986","2017-06-01 14:17:43.456","2017-06-01 15:15:23.275","2017-06-01 15:41:32.609","2017-06-01 16:04:17.986","2017-06-01 16:17:43.456","2017-06-01 17:15:23.275","2017-06-01 17:41:32.609","2017-06-01 18:04:17.986","2017-06-01 18:17:43.456","2017-06-01 19:15:23.275","2017-06-01 19:41:32.609","2017-06-01 20:04:17.986","2017-06-01 20:17:43.456","2017-06-01 21:15:23.275","2017-06-01 21:41:32.609","2017-06-01 22:04:17.986","2017-06-01 22:17:43.456","2017-06-01 23:15:23.275","2017-06-01 23:41:32.609")),
ToiletZone = c("B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B"))
day_2_A<- data.frame(Datetime= ymd_hms(c("2017-06-02 00:17:43.456","2017-06-02 00:48:43.456","2017-06-02 01:15:23.275","2017-06-02 01:52:23.275","2017-06-02 02:04:17.986","2017-06-02 02:17:43.456","2017-06-02 03:15:23.275","2017-06-02 03:41:32.609","2017-06-02 04:04:17.986","2017-06-02 04:17:43.456","2017-06-02 05:15:23.275","2017-06-02 05:41:32.609","2017-06-02 06:04:17.986","2017-06-02 06:17:43.456","2017-06-02 07:15:23.275","2017-06-02 07:41:32.609","2017-06-02 08:04:17.986","2017-06-02 08:17:43.456","2017-06-02 09:15:23.275","2017-06-02 09:41:32.609","2017-06-02 10:04:17.986","2017-06-02 10:17:43.456","2017-06-02 11:15:23.275","2017-06-02 11:41:32.609","2017-06-02 12:04:17.986","2017-06-02 12:17:43.456","2017-06-02 13:15:23.275","2017-06-02 13:41:32.609","2017-06-02 14:04:17.986","2017-06-02 14:17:43.456","2017-06-02 15:15:23.275","2017-06-02 15:41:32.609","2017-06-02 16:04:17.986","2017-06-02 16:17:43.456","2017-06-02 17:15:23.275","2017-06-02 17:41:32.609","2017-06-02 18:04:17.986","2017-06-02 18:17:43.456","2017-06-02 19:15:23.275","2017-06-02 19:41:32.609","2017-06-02 20:04:17.986","2017-06-02 20:17:43.456","2017-06-02 21:15:23.275","2017-06-02 21:41:32.609","2017-06-02 22:04:17.986","2017-06-02 22:17:43.456","2017-06-02 23:15:23.275","2017-06-02 23:41:32.609")),
ToiletZone = c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A"))
day_2_B<- data.frame(Datetime= ymd_hms(c("2017-06-02 00:04:17.986","2017-06-02 01:15:23.275","2017-06-02 02:04:17.986","2017-06-02 02:17:43.456","2017-06-02 03:15:23.275","2017-06-02 03:41:32.609","2017-06-02 04:04:17.986","2017-06-02 04:17:43.456","2017-06-02 05:15:23.275","2017-06-02 05:41:32.609","2017-06-02 06:04:17.986","2017-06-02 06:17:43.456","2017-06-02 07:15:23.275","2017-06-02 07:41:32.609","2017-06-02 08:04:17.986","2017-06-02 08:17:43.456","2017-06-02 09:15:23.275","2017-06-02 09:41:32.609","2017-06-02 10:04:17.986","2017-06-02 10:17:43.456","2017-06-02 11:15:23.275","2017-06-02 11:41:32.609","2017-06-02 12:04:17.986","2017-06-02 12:17:43.456","2017-06-02 13:15:23.275","2017-06-02 13:41:32.609","2017-06-02 14:04:17.986","2017-06-02 14:17:43.456","2017-06-02 15:15:23.275","2017-06-02 15:41:32.609","2017-06-02 16:04:17.986","2017-06-02 16:17:43.456","2017-06-02 17:15:23.275","2017-06-02 17:41:32.609","2017-06-02 18:04:17.986","2017-06-02 18:17:43.456","2017-06-02 19:15:23.275","2017-06-02 19:41:32.609","2017-06-02 20:04:17.986","2017-06-02 20:17:43.456","2017-06-02 21:15:23.275","2017-06-02 21:41:32.609","2017-06-02 22:04:17.986","2017-06-02 22:17:43.456","2017-06-02 23:15:23.275","2017-06-02 23:41:32.609")),
ToiletZone = c("B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B"))
df1<- rbind(day_1_A,day_1_B,day_2_A,day_2_B)
df1
> df1
Datetime ToiletZone
1 2017-06-01 00:04:17.986 A
2 2017-06-01 00:17:43.455 A
3 2017-06-01 00:22:43.455 A
4 2017-06-01 00:34:43.455 A
5 2017-06-01 00:45:43.455 A
6 2017-06-01 01:15:23.275 A
. . .
. . .
. . .
193 2017-06-02 23:41:32.608 B
For some reasons I won't explain here, I need to calculate for EACH DAY and for EACH ZONE a statistic called θ
, that could be defined as the coefficient of the division of the "average hourly number of visits to the toilet during the day" ( Hourly_daily_μ
) by the "average hourly number of visits for the entire period of interest" ( Overall_hourly_μ
).
I show in a picture what I would expect from the previous example (the columns Hourly_daily_μ_A
, Hourly_daily_μ_B
, Overall_hourly_μ_A
and Overall_hourly_μ_A
are incorporated to clarify the calculations. The columns that I really need are θ_A
and θ_B
):
Why Hourly_daily_μ_A
is 51/24 on 2017-06-01? Because this day there were 51 persons that went to the toilet. Hence, if we divide between 24 we get the hourly mean of people that went to the toilet this day.
Why Overall_hourly_μ_A
is the same for each zone for the different days? Because it is an overall mean for each zone. Here we want to know what is the general average of people that go to the toilet per hour. In this example, we know that 99 persons went to the toilet between the 1st June and the 2nd June in the Zone A. So we divide this between the total number of hours (48 hours in the example) and we get the overall hourly mean of people that go to the toilet in the zone A. It is a unique value for each Zone.
Why θ_A
is (51*48)/(24*99) on the 2017-06-01? Because is the result of dividing Hourly_daily_μ_A
(51/24) by Overall_hourly_μ_A
(99/48).
Does anyone know how to do it? My dataframe is quite large so I guess that the package data.table
could be a good option.
An option would be do group by frequency count, do some calculations to get the expected output
library(dplyr)
library(tidyr)
library(lubridate)
df1 %>%
mutate(Date = floor_date(Datetime, "hour")) %>%
group_by(ToiletZone, Date) %>%
mutate(hourlyCount = n(), HourlyAvg = hourlyCount/24) %>%
group_by(ToiletZone) %>%
mutate(Total = sum(hourlyCount)/ n() * 24) %>%
group_by(Date = as.Date(Date), add = TRUE) %>%
summarise(Theta = hourlyCount[1]/Total[1]) %>%
spread(ToiletZone, Theta)
I think you only need to floor your dates to a day unit and then you can use it for grouping. With data.table
:
setDT(df1)
df1[, Date := floor_date(Datetime, "day")]
daily <- df1[, .(DailyCount = .N, DailyAvg = .N / 24), by = .(ToiletZone, Date)]
overall <- daily[, .(Total = sum(DailyCount) / (.N * 24)), by = .(ToiletZone)]
overall[daily, .(ToiletZone, Date, Theta = DailyAvg / Total), on = "ToiletZone"]
ToiletZone Date Theta
1: A 2017-06-01 1.0303030
2: B 2017-06-01 1.0212766
3: A 2017-06-02 0.9696970
4: B 2017-06-02 0.9787234
And hourly would be similar, just change floor_date
and adjust some denominators:
df1[, Date := floor_date(Datetime, "hour")]
hourly <- df1[, .(HourlyCount = .N), by = .(ToiletZone, Date)]
overall <- hourly[, .(Total = sum(HourlyCount) / .N), by = "ToiletZone"]
ans <- overall[hourly, .(ToiletZone, Date, Theta = HourlyCount / Total), on = "ToiletZone"]
BTW, the last lines are a join, you can think of them as a left join with, respectively, daily
and hourly
as the left-hand table.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.