[英]how to calculate the median for groups separately in R
數據的小例子
df=structure(list(Dt = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L,
9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L,
22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L,
35L, 36L, 37L, 38L, 39L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L,
10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L,
23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L, 35L,
36L, 37L, 38L, 39L), .Label = c("2018-02-20 00:00:00.000", "2018-02-21 00:00:00.000",
"2018-02-22 00:00:00.000", "2018-02-23 00:00:00.000", "2018-02-24 00:00:00.000",
"2018-02-25 00:00:00.000", "2018-02-26 00:00:00.000", "2018-02-27 00:00:00.000",
"2018-02-28 00:00:00.000", "2018-03-01 00:00:00.000", "2018-03-02 00:00:00.000",
"2018-03-03 00:00:00.000", "2018-03-04 00:00:00.000", "2018-03-05 00:00:00.000",
"2018-03-06 00:00:00.000", "2018-03-07 00:00:00.000", "2018-03-08 00:00:00.000",
"2018-03-09 00:00:00.000", "2018-03-10 00:00:00.000", "2018-03-11 00:00:00.000",
"2018-03-12 00:00:00.000", "2018-03-13 00:00:00.000", "2018-03-14 00:00:00.000",
"2018-03-15 00:00:00.000", "2018-03-16 00:00:00.000", "2018-03-17 00:00:00.000",
"2018-03-18 00:00:00.000", "2018-03-19 00:00:00.000", "2018-03-20 00:00:00.000",
"2018-03-21 00:00:00.000", "2018-03-22 00:00:00.000", "2018-03-23 00:00:00.000",
"2018-03-24 00:00:00.000", "2018-03-25 00:00:00.000", "2018-03-26 00:00:00.000",
"2018-03-27 00:00:00.000", "2018-03-28 00:00:00.000", "2018-03-29 00:00:00.000",
"2018-03-30 00:00:00.000"), class = "factor"), ItemRelation = c(158043L,
158043L, 158043L, 158043L, 158043L, 158043L, 158043L, 158043L,
158043L, 158043L, 158043L, 158043L, 158043L, 158043L, 158043L,
158043L, 158043L, 158043L, 158043L, 158043L, 158043L, 158043L,
158043L, 158043L, 158043L, 158043L, 158043L, 158043L, 158043L,
158043L, 158043L, 158043L, 158043L, 158043L, 158043L, 158043L,
158043L, 158043L, 158043L, 234L, 234L, 234L, 234L, 234L, 234L,
234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L,
234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L,
234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L, 234L
), stuff = c(200L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 3600L,
0L, 0L, 0L, 0L, 700L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 1000L, 2600L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 400L, 700L,
200L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 3600L, 0L, 0L, 0L,
0L, 700L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1000L,
2600L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 400L, 700L), num = c(1459L,
1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L,
1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L,
1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L,
1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L,
1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L,
1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L,
1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L,
1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L, 1459L,
1459L, 1459L, 1459L, 1459L, 1459L), year = c(2018L, 2018L, 2018L,
2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L,
2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L,
2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L,
2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L,
2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L,
2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L,
2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L,
2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L,
2018L, 2018L, 2018L), action = c(0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 1L, 1L, 1L, 1L)), .Names = c("Dt", "ItemRelation",
"stuff", "num", "year", "action"), class = "data.frame", row.names = c(NA,
-78L))
現在,對於每個組ItemRelation + num + year,我必須計算中位數。 如果我使用此解決方案
# df with action 0 and stuff > 0
v <- df$stuff[intersect(which(df$action == 0),
which(df$stuff > 0))]
# df with action 1 and stuff > 0
w <- df$stuff[intersect(which(df$action == 1),
which(df$stuff > 0))]
# calulating the median of v for the last 5 observations
l <- length(v)
m0 <- median(v[(l-4):l]) # taking the median of the last 5 observations
# computing the final difference
m <- median(w) - m0
我一次計算所有組的中位數,但我必須分別計算每個組的中位數。 我該如何執行?
這是預期的輸出
ItemRelation num year value
158043 1459 2018 45
158043 234 2018 67
帖子已編輯。 請注意,該值不是真實的,中位數將是另一個,我只想顯示輸出結果
動作列只有兩個值0和1。我必須使用1個類別前的最后五個整數值來計算1個類別的動作的中值,然后是零類別的動作的中值。 我只接受最后5個觀察值,有必要采取零作用類別中的最后5個觀察結果,但僅取整數值,而不是按零類別所有值計算中位數。 在我們的情況下,這是
200
3600
700
1000
2600
然后從一個類別的中位數減去零類別的中位數。
零行動類別中按事物進行觀察的次數范圍可以從0到10。 如果我們有10個零類別的整數,則取最后五個。 如果只有1,2,3,4,5個整數值,我們減去整數的實數的中位數。 如果我們只有0而沒有整數,那么我們就等於0。
但是代碼必須按零類別計算中位數,但必須按一個類別計算5個最后一個obs。
請注意,動作類別為零的值可能不是0,而是其他值。
最簡單的方法是使用group_by
並從dplyr
包中進行summarize
:
library(dplyr)
# median of groups
medians <- df %>%
group_by(ItemRelation, num, year) %>%
summarize(med = median(stuff, na.rm = T))
# median of nonzero values in each group
medians <- df %>%
filter(stuff>0) %>%
group_by(ItemRelation, num, year) %>%
summarize(med = median(stuff, na.rm = T))
subtract <- function(x){return(x[1]-x[2])}
median_diffs <- medians %>%
group_by(ItemRelation, num, year) %>%
mutate(med_diff = subtract(med))
使用dplyr
並遵循以下提到的步驟可以實現一種解決方案。 請在下面的代碼中找到有關方法的注釋。
注意:如此看來,來自OP的樣本數據並不是很有意義。
library(dplyr)
df %>% filter(stuff > 0) %>% #First filter out for stuff > 0 which of our interest
group_by(ItemRelation, num, year) %>%
mutate(m = median(stuff[action==1]),
m0 = median(tail(stuff[action==0], 5))) %>% # Calculate m and m0 for all rows
filter(action == 1) %>% # Now keep only rows with action == 1
mutate(m = m-m0) %>%
select(-Dt,-m0,-action)
# # A tibble: 4 x 5
# # Groups: ItemRelation, num, year [2]
# ItemRelation stuff num year m
# <int> <int> <int> <int> <dbl>
# 1 158043 400 1459 2018 -450
# 2 158043 700 1459 2018 -450
# 3 234 400 1459 2018 -450
# 4 234 700 1459 2018 -450
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.