[英]add a new grouping variable in dplyr
# A tibble: 42 x 5
Effective_Date Gender Location n freq
<date> <chr> <chr> <int> <dbl>
1 2017-01-01 Female India 281 0.351
2 2017-01-01 Female US 2446 0.542
3 2017-02-01 Female India 285 0.349
4 2017-02-01 Female US 2494 0.543
5 2017-03-01 Female India 293 0.353
6 2017-03-01 Female US 2494 0.542
7 2017-04-01 Female India 292 0.350
8 2017-04-01 Female US 2475 0.542
9 2017-05-01 Female India 272 0.337
10 2017-05-01 Female US 2493 0.540
如果我有下表,並且想在每個生效日期前添加一行,以獲取平均freq
。 我將如何去做? 我試過了
tbl %>%
group_by(Effective_Date) %>%
mutate(Gender = 'Female',Location='All',freq_all = mean(freq)) %>%
bind_rows(female,.) %>%
ungroup() %>%
arrange(Effective_Date)
但這給了我很多重復的行。
理想的結果應如下所示:
# A tibble: 42 x 5
Effective_Date Gender Location n freq
<date> <chr> <chr> <int> <dbl>
1 2017-01-01 Female India 281 0.351
2 2017-01-01 Female US 2446 0.542
3 2017-01-01 Female All NA 0.447
4 etc etc etc etc
這將適用於您提供的特定示例 :
df = read.table(text = "
Effective_Date Gender Location n freq
1 2017-01-01 Female India 281 0.351
2 2017-01-01 Female US 2446 0.542
3 2017-02-01 Female India 285 0.349
4 2017-02-01 Female US 2494 0.543
", header=T)
library(dplyr)
df %>%
group_by(Effective_Date) %>%
summarise(freq = mean(freq)) %>%
mutate(Gender = "Female",
Location = "all",
n = NA) %>%
bind_rows(df) %>%
arrange(Effective_Date)
# # A tibble: 6 x 5
# Effective_Date Gender Location n freq
# <fct> <chr> <chr> <int> <dbl>
# 1 2017-01-01 Female all NA 0.446
# 2 2017-01-01 Female India 281 0.351
# 3 2017-01-01 Female US 2446 0.542
# 4 2017-02-01 Female all NA 0.446
# 5 2017-02-01 Female India 285 0.349
# 6 2017-02-01 Female US 2494 0.543
這對於更一般的情況也適用,在“ Gender
列中同時包含Female
和Male
。
df = read.table(text = "
Effective_Date Gender Location n freq
1 2017-01-01 Female India 281 0.351
2 2017-01-01 Female US 2446 0.542
3 2017-02-01 Female India 285 0.349
4 2017-02-01 Female US 2494 0.543
5 2017-01-01 Male India 556 0.386
6 2017-01-01 Male US 1123 0.668
7 2017-02-01 Male India 449 0.389
8 2017-02-01 Male US 2237 0.511
", header=T)
library(dplyr)
df %>%
group_by(Effective_Date, Gender) %>%
summarise(freq = mean(freq)) %>%
ungroup() %>%
mutate(Location = "all",
n = NA) %>%
bind_rows(df) %>%
arrange(Effective_Date, Gender)
# # A tibble: 12 x 5
# Effective_Date Gender freq Location n
# <fct> <fct> <dbl> <chr> <int>
# 1 2017-01-01 Female 0.446 all NA
# 2 2017-01-01 Female 0.351 India 281
# 3 2017-01-01 Female 0.542 US 2446
# 4 2017-01-01 Male 0.527 all NA
# 5 2017-01-01 Male 0.386 India 556
# 6 2017-01-01 Male 0.668 US 1123
# 7 2017-02-01 Female 0.446 all NA
# 8 2017-02-01 Female 0.349 India 285
# 9 2017-02-01 Female 0.543 US 2494
#10 2017-02-01 Male 0.45 all NA
#11 2017-02-01 Male 0.389 India 449
#12 2017-02-01 Male 0.511 US 2237
data.table中有一個用於此的函數:
library(data.table)
setDT(df)
res = groupingsets(df, by=c("Effective_Date", "Gender", "Location"),
sets=list(
c("Effective_Date", "Gender"),
c("Effective_Date", "Gender", "Location")
), j = .(n = sum(n), freq = mean(freq))
)[order(Effective_Date, Gender, Location, na.last=TRUE)]
Effective_Date Gender Location n freq
1: 2017-01-01 Female India 281 0.3510
2: 2017-01-01 Female US 2446 0.5420
3: 2017-01-01 Female <NA> 2727 0.4465
4: 2017-02-01 Female India 285 0.3490
5: 2017-02-01 Female US 2494 0.5430
6: 2017-02-01 Female <NA> 2779 0.4460
因此,您將分為兩個級別,第二個級別不包含Location
。 如果要顯示"All"
而不是NA
,則有res[is.na(Location), Location := "All"][]
。
(在這里似乎應該使用weighted.mean(freq, n)
而不是mean(freq)
。。。這也包括所有行的計數n
,因為這看起來很奇怪,否則很乏味。)
簡短一些:
myby = c("Effective_Date", "Gender", "Location")
groupingsets(df,
j = .(n = sum(n), freq = mean(freq)),
by=myby, sets=list(myby, head(myby, -1))
)[, setorderv(.SD, myby, na.last=TRUE)]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.