![](/img/trans.png)
[英]How to estimate means from same column in large number of dataframes, based upon a grouping variable in R
[英]How to extract from a large dataset hourly daily means by level from a factor variable and estimate other statistics
數據df1
匯總了人們在特定時間段內進入公共廁所的不同日期時間(例如,在“ 2017-06-01”和“ 2017-06-30”之間)。 Zone
列指定了放置馬桶的區域,該區域有兩個層次: A
(聚會區域)或B
(居住區域)。
我在下面顯示我所擁有的可復制示例。 本示例僅包含兩天時間以減小示例數據集的大小。 為了創建df1
我必須先創建4個單獨的數據幀,然后將它們綁定以創建數據幀df1
(嘗試一次創建df1
時出現錯誤)。 df1
有193行。
options(digits.secs=3)
day_1_A<- data.frame(Datetime= ymd_hms(c("2017-06-01 00:04:17.986","2017-06-01 00:17:43.456","2017-06-01 00:22:43.456","2017-06-01 00:34:43.456","2017-06-01 00:45:43.456","2017-06-01 01:15:23.275","2017-06-01 01:41:32.609","2017-06-01 02:04:17.986","2017-06-01 02:17:43.456","2017-06-01 03:15:23.275","2017-06-01 03:41:32.609","2017-06-01 04:04:17.986","2017-06-01 04:17:43.456","2017-06-01 05:15:23.275","2017-06-01 05:41:32.609","2017-06-01 06:04:17.986","2017-06-01 06:17:43.456","2017-06-01 07:15:23.275","2017-06-01 07:41:32.609","2017-06-01 08:04:17.986","2017-06-01 08:17:43.456","2017-06-01 09:15:23.275","2017-06-01 09:41:32.609","2017-06-01 10:04:17.986","2017-06-01 10:17:43.456","2017-06-01 11:15:23.275","2017-06-01 11:41:32.609","2017-06-01 12:04:17.986","2017-06-01 12:17:43.456","2017-06-01 13:15:23.275","2017-06-01 13:41:32.609","2017-06-01 14:04:17.986","2017-06-01 14:17:43.456","2017-06-01 15:17:23.275","2017-06-01 15:41:32.609","2017-06-01 16:04:17.986","2017-06-01 16:17:43.456","2017-06-01 17:15:23.275","2017-06-01 17:41:32.609","2017-06-01 18:04:17.986","2017-06-01 18:17:43.456","2017-06-01 19:15:23.275","2017-06-01 19:41:32.609","2017-06-01 20:04:17.986","2017-06-01 20:17:43.456","2017-06-01 21:15:23.275","2017-06-01 21:41:32.609","2017-06-01 22:04:17.986","2017-06-01 22:17:43.456","2017-06-01 23:15:23.275","2017-06-01 23:41:32.609")),
ToiletZone = c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A"))
day_1_B<- data.frame(Datetime= ymd_hms(c("2017-06-01 00:04:17.986","2017-06-01 00:17:43.456","2017-06-01 01:15:23.275","2017-06-01 01:41:32.609","2017-06-01 02:04:17.986","2017-06-01 02:17:43.456","2017-06-01 03:15:23.275","2017-06-01 03:41:32.609","2017-06-01 04:04:17.986","2017-06-01 04:17:43.456","2017-06-01 05:15:23.275","2017-06-01 05:41:32.609","2017-06-01 06:04:17.986","2017-06-01 06:17:43.456","2017-06-01 07:15:23.275","2017-06-01 07:41:32.609","2017-06-01 08:04:17.986","2017-06-01 08:17:43.456","2017-06-01 09:15:23.275","2017-06-01 09:41:32.609","2017-06-01 10:04:17.986","2017-06-01 10:17:43.456","2017-06-01 11:15:23.275","2017-06-01 11:41:32.609","2017-06-01 12:04:17.986","2017-06-01 12:17:43.456","2017-06-01 13:15:23.275","2017-06-01 13:41:32.609","2017-06-01 14:04:17.986","2017-06-01 14:17:43.456","2017-06-01 15:15:23.275","2017-06-01 15:41:32.609","2017-06-01 16:04:17.986","2017-06-01 16:17:43.456","2017-06-01 17:15:23.275","2017-06-01 17:41:32.609","2017-06-01 18:04:17.986","2017-06-01 18:17:43.456","2017-06-01 19:15:23.275","2017-06-01 19:41:32.609","2017-06-01 20:04:17.986","2017-06-01 20:17:43.456","2017-06-01 21:15:23.275","2017-06-01 21:41:32.609","2017-06-01 22:04:17.986","2017-06-01 22:17:43.456","2017-06-01 23:15:23.275","2017-06-01 23:41:32.609")),
ToiletZone = c("B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B"))
day_2_A<- data.frame(Datetime= ymd_hms(c("2017-06-02 00:17:43.456","2017-06-02 00:48:43.456","2017-06-02 01:15:23.275","2017-06-02 01:52:23.275","2017-06-02 02:04:17.986","2017-06-02 02:17:43.456","2017-06-02 03:15:23.275","2017-06-02 03:41:32.609","2017-06-02 04:04:17.986","2017-06-02 04:17:43.456","2017-06-02 05:15:23.275","2017-06-02 05:41:32.609","2017-06-02 06:04:17.986","2017-06-02 06:17:43.456","2017-06-02 07:15:23.275","2017-06-02 07:41:32.609","2017-06-02 08:04:17.986","2017-06-02 08:17:43.456","2017-06-02 09:15:23.275","2017-06-02 09:41:32.609","2017-06-02 10:04:17.986","2017-06-02 10:17:43.456","2017-06-02 11:15:23.275","2017-06-02 11:41:32.609","2017-06-02 12:04:17.986","2017-06-02 12:17:43.456","2017-06-02 13:15:23.275","2017-06-02 13:41:32.609","2017-06-02 14:04:17.986","2017-06-02 14:17:43.456","2017-06-02 15:15:23.275","2017-06-02 15:41:32.609","2017-06-02 16:04:17.986","2017-06-02 16:17:43.456","2017-06-02 17:15:23.275","2017-06-02 17:41:32.609","2017-06-02 18:04:17.986","2017-06-02 18:17:43.456","2017-06-02 19:15:23.275","2017-06-02 19:41:32.609","2017-06-02 20:04:17.986","2017-06-02 20:17:43.456","2017-06-02 21:15:23.275","2017-06-02 21:41:32.609","2017-06-02 22:04:17.986","2017-06-02 22:17:43.456","2017-06-02 23:15:23.275","2017-06-02 23:41:32.609")),
ToiletZone = c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A"))
day_2_B<- data.frame(Datetime= ymd_hms(c("2017-06-02 00:04:17.986","2017-06-02 01:15:23.275","2017-06-02 02:04:17.986","2017-06-02 02:17:43.456","2017-06-02 03:15:23.275","2017-06-02 03:41:32.609","2017-06-02 04:04:17.986","2017-06-02 04:17:43.456","2017-06-02 05:15:23.275","2017-06-02 05:41:32.609","2017-06-02 06:04:17.986","2017-06-02 06:17:43.456","2017-06-02 07:15:23.275","2017-06-02 07:41:32.609","2017-06-02 08:04:17.986","2017-06-02 08:17:43.456","2017-06-02 09:15:23.275","2017-06-02 09:41:32.609","2017-06-02 10:04:17.986","2017-06-02 10:17:43.456","2017-06-02 11:15:23.275","2017-06-02 11:41:32.609","2017-06-02 12:04:17.986","2017-06-02 12:17:43.456","2017-06-02 13:15:23.275","2017-06-02 13:41:32.609","2017-06-02 14:04:17.986","2017-06-02 14:17:43.456","2017-06-02 15:15:23.275","2017-06-02 15:41:32.609","2017-06-02 16:04:17.986","2017-06-02 16:17:43.456","2017-06-02 17:15:23.275","2017-06-02 17:41:32.609","2017-06-02 18:04:17.986","2017-06-02 18:17:43.456","2017-06-02 19:15:23.275","2017-06-02 19:41:32.609","2017-06-02 20:04:17.986","2017-06-02 20:17:43.456","2017-06-02 21:15:23.275","2017-06-02 21:41:32.609","2017-06-02 22:04:17.986","2017-06-02 22:17:43.456","2017-06-02 23:15:23.275","2017-06-02 23:41:32.609")),
ToiletZone = c("B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B"))
df1<- rbind(day_1_A,day_1_B,day_2_A,day_2_B)
df1
> df1
Datetime ToiletZone
1 2017-06-01 00:04:17.986 A
2 2017-06-01 00:17:43.455 A
3 2017-06-01 00:22:43.455 A
4 2017-06-01 00:34:43.455 A
5 2017-06-01 00:45:43.455 A
6 2017-06-01 01:15:23.275 A
. . .
. . .
. . .
193 2017-06-02 23:41:32.608 B
由於某些原因,我在這里不做解釋,我需要為每個天和每個區域計算一個稱為θ
的統計量,該統計量可以定義為“白天平均每小時上廁所次數”的除法系數”( Hourly_daily_μ
)表示為“整個感興趣期間的平均每小時訪問量”( Overall_hourly_μ
)。
我在一張圖片中顯示了我對上一個示例的期望(將Hourly_daily_μ_A
列, Hourly_daily_μ_B
, Overall_hourly_μ_A
和Overall_hourly_μ_A
合並以闡明計算。我真正需要的列是θ_A
和θ_B
):
為什么在Hourly_daily_μ_A
6月1日Hourly_daily_μ_A
是51/24? 因為這一天有51個人上廁所。 因此,如果我們將24人之間進行除法,我們將得出這一天去廁所的人的每小時均值。
為什么每個區域的不同天的Overall_hourly_μ_A
是相同的? 因為這是每個區域的總體平均值。 在這里,我們想知道每小時上廁所的人的平均水平是多少。 在此示例中,我們知道在6月1日至6月2日之間,A區有99人上廁所。因此,我們將其除以總小時數(在本示例中為48小時),然后得出總體小時均值在A區中上廁所的人數。每個區的價值都是唯一的。
為什么在θ_A
是(51 * 48)/(24 * 99)? 因為是分割的結果Hourly_daily_μ_A
(51/24)由Overall_hourly_μ_A
(48分之99)。
有人知道怎么做嗎? 我的數據data.table
很大,所以我想包data.table
可能是一個不錯的選擇。
一個選項是按頻率計數分組,進行一些計算以獲得預期的輸出
library(dplyr)
library(tidyr)
library(lubridate)
df1 %>%
mutate(Date = floor_date(Datetime, "hour")) %>%
group_by(ToiletZone, Date) %>%
mutate(hourlyCount = n(), HourlyAvg = hourlyCount/24) %>%
group_by(ToiletZone) %>%
mutate(Total = sum(hourlyCount)/ n() * 24) %>%
group_by(Date = as.Date(Date), add = TRUE) %>%
summarise(Theta = hourlyCount[1]/Total[1]) %>%
spread(ToiletZone, Theta)
我認為您只需要將日期設為天單位,然后就可以將其用於分組。 隨着data.table
:
setDT(df1)
df1[, Date := floor_date(Datetime, "day")]
daily <- df1[, .(DailyCount = .N, DailyAvg = .N / 24), by = .(ToiletZone, Date)]
overall <- daily[, .(Total = sum(DailyCount) / (.N * 24)), by = .(ToiletZone)]
overall[daily, .(ToiletZone, Date, Theta = DailyAvg / Total), on = "ToiletZone"]
ToiletZone Date Theta
1: A 2017-06-01 1.0303030
2: B 2017-06-01 1.0212766
3: A 2017-06-02 0.9696970
4: B 2017-06-02 0.9787234
並且每小時會類似,只需更改floor_date
並調整一些分母:
df1[, Date := floor_date(Datetime, "hour")]
hourly <- df1[, .(HourlyCount = .N), by = .(ToiletZone, Date)]
overall <- hourly[, .(Total = sum(HourlyCount) / .N), by = "ToiletZone"]
ans <- overall[hourly, .(ToiletZone, Date, Theta = HourlyCount / Total), on = "ToiletZone"]
順便說一句,最后幾行是一個聯接,您可以將它們視為左聯接,而daily
和hourly
分別作為左表。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.