簡體   English   中英

在dplyr中添加新的分組變量

[英]add a new grouping variable in dplyr

# A tibble: 42 x 5
   Effective_Date Gender Location     n  freq
   <date>         <chr>  <chr>    <int> <dbl>
 1 2017-01-01     Female India      281 0.351
 2 2017-01-01     Female US        2446 0.542
 3 2017-02-01     Female India      285 0.349
 4 2017-02-01     Female US        2494 0.543
 5 2017-03-01     Female India      293 0.353
 6 2017-03-01     Female US        2494 0.542
 7 2017-04-01     Female India      292 0.350
 8 2017-04-01     Female US        2475 0.542
 9 2017-05-01     Female India      272 0.337
10 2017-05-01     Female US        2493 0.540

如果我有下表,並且想在每個生效日期前添加一行,以獲取平均freq 我將如何去做? 我試過了

tbl %>% 
  group_by(Effective_Date) %>% 
  mutate(Gender = 'Female',Location='All',freq_all = mean(freq)) %>% 
  bind_rows(female,.) %>% 
  ungroup() %>% 
  arrange(Effective_Date)

但這給了我很多重復的行。

理想的結果應如下所示:

 # A tibble: 42 x 5
       Effective_Date Gender Location     n  freq
       <date>         <chr>  <chr>    <int> <dbl>
     1 2017-01-01     Female India      281 0.351
     2 2017-01-01     Female US        2446 0.542
     3 2017-01-01     Female All         NA 0.447
     4 etc etc etc etc

這將適用於您提供的特定示例

df = read.table(text = "
Effective_Date Gender Location     n  freq
1 2017-01-01     Female India      281 0.351
2 2017-01-01     Female US        2446 0.542
3 2017-02-01     Female India      285 0.349
4 2017-02-01     Female US        2494 0.543
", header=T)

library(dplyr)

df %>%
  group_by(Effective_Date) %>%
  summarise(freq = mean(freq)) %>%
  mutate(Gender = "Female",
         Location = "all",
         n = NA) %>%
  bind_rows(df) %>%
  arrange(Effective_Date)

# # A tibble: 6 x 5
#   Effective_Date Gender Location     n  freq
#   <fct>          <chr>  <chr>    <int> <dbl>
# 1 2017-01-01     Female all         NA 0.446
# 2 2017-01-01     Female India      281 0.351
# 3 2017-01-01     Female US        2446 0.542
# 4 2017-02-01     Female all         NA 0.446
# 5 2017-02-01     Female India      285 0.349
# 6 2017-02-01     Female US        2494 0.543

這對於更一般的情況也適用,在“ Gender列中同時包含FemaleMale

df = read.table(text = "
Effective_Date Gender Location     n  freq
1 2017-01-01     Female India      281 0.351
2 2017-01-01     Female US        2446 0.542
3 2017-02-01     Female India      285 0.349
4 2017-02-01     Female US        2494 0.543
5 2017-01-01     Male India      556 0.386
6 2017-01-01     Male US        1123 0.668
7 2017-02-01     Male India      449 0.389
8 2017-02-01     Male US        2237 0.511
", header=T)

library(dplyr)

df %>%
  group_by(Effective_Date, Gender) %>%
  summarise(freq = mean(freq)) %>%
  ungroup() %>%
  mutate(Location = "all",
         n = NA) %>%
  bind_rows(df) %>%
  arrange(Effective_Date, Gender) 

# # A tibble: 12 x 5
#   Effective_Date Gender  freq Location     n
#   <fct>          <fct>  <dbl> <chr>    <int>
# 1 2017-01-01     Female 0.446 all         NA
# 2 2017-01-01     Female 0.351 India      281
# 3 2017-01-01     Female 0.542 US        2446
# 4 2017-01-01     Male   0.527 all         NA
# 5 2017-01-01     Male   0.386 India      556
# 6 2017-01-01     Male   0.668 US        1123
# 7 2017-02-01     Female 0.446 all         NA
# 8 2017-02-01     Female 0.349 India      285
# 9 2017-02-01     Female 0.543 US        2494
#10 2017-02-01     Male   0.45  all         NA
#11 2017-02-01     Male   0.389 India      449
#12 2017-02-01     Male   0.511 US        2237

data.table中有一個用於此的函數:

library(data.table)
setDT(df)

res = groupingsets(df, by=c("Effective_Date", "Gender", "Location"), 
  sets=list(
    c("Effective_Date", "Gender"), 
    c("Effective_Date", "Gender", "Location")
  ), j = .(n = sum(n), freq = mean(freq))
)[order(Effective_Date, Gender, Location, na.last=TRUE)]

   Effective_Date Gender Location    n   freq
1:     2017-01-01 Female    India  281 0.3510
2:     2017-01-01 Female       US 2446 0.5420
3:     2017-01-01 Female     <NA> 2727 0.4465
4:     2017-02-01 Female    India  285 0.3490
5:     2017-02-01 Female       US 2494 0.5430
6:     2017-02-01 Female     <NA> 2779 0.4460

因此,您將分為兩個級別,第二個級別不包含Location 如果要顯示"All"而不是NA ,則有res[is.na(Location), Location := "All"][]

(在這里似乎應該使用weighted.mean(freq, n)而不是mean(freq) 。。。這也包括所有行的計數n ,因為這看起來很奇怪,否則很乏味。)

簡短一些:

myby = c("Effective_Date", "Gender", "Location")
groupingsets(df, 
  j = .(n = sum(n), freq = mean(freq)), 
  by=myby, sets=list(myby, head(myby, -1))
)[, setorderv(.SD, myby, na.last=TRUE)]

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM