繁体   English   中英

我如何在R中某些行(属于其他子集)属于多个组的情况下进行分组?

[英]How do I group by in R where some rows (that are subsets of others) belong to multiple groups?

这是我的数据。 我希望按日期,ID1和ID2对行进行分组。 ID3中的行被分组为它们的子集,即第一和第二ID匹配的所有行。 另外,要累加统计信息并生成n()。

      date     ID1     ID2     ID3 stat1 stat2 stat3
1 12-03-07  abc123  wxy456  pqr123    10    20     30  
2 12-03-07  abc123  wxy456  pqr123    20    40     60  
3 10-04-07  bcd456  wxy456  hgf356    10    20     40  
4 12-03-07  abc123  wxy456  hfz123    30    60     90  
5 12-03-07  abc123  wxy456   <NA>     40    50     70  

期望的输出

date      ID1,     ID2,   ID3,   n, stat1, stat2, stat3  
12-03-07 abc123, wxy456, pqr123, 3,   70,   110, 160  
10-04-07 bcd456, wxy456, hgf356, 1,   10,   20,  40  
12-03-07 abc123, wxy456, hfz123, 2 ,  40,   50,  70  

可能是更优雅的解决方案,但使用dplyr的groupby / summarise(如Adam Quek的代码)解决了这一问题,方法是加入并计算平均值。

        # Summarize 
        df <- df %>% group_by(date, ID1, ID2, ID3) %>% summarise(n=n(), stat1=sum(stat1), stat2=sum(stat2), stat3=sum(stat3)

        # Select instances where NA
        dfNA <- df %>% filter(is.na(ID3))

        # Select instances where no NA
        df1 <- df %>% filter(!is.na(ID3))

        # Join these
        dfBig <- df1 %>% full_join(dfNA, by = c("date", "ID1")) %>% 
          subset(select= c("ID1", "date", "n.x", "n.y", "stat1.x", "stat1.y", "stat2.x", "stat2.y", "stat3.x", "stat3.y"))

        # Replace <NA>s by 0
        dfBig$stat1.x[is.na(dfBig$stat1.x)] <- 0
        dfBig$stat1.y[is.na(dfBig$stat1.y)] <- 0
        dfBig$stat2.x[is.na(dfBig$stat1.x)] <- 0
        dfBig$stat2.y[is.na(dfBig$stat1.y)] <- 0
        dfBig$stat3.x[is.na(dfBig$stat1.x)] <- 0
        dfBig$stat3.y[is.na(dfBig$stat1.y)] <- 0
        dfBig$n.x[is.na(dfBig$n.x)] <- 0
        dfBig$n.y[is.na(dfBig$n.y)] <- 0

    # Compute Mean stats and Rename Columns
        dfBig$stat1Mean <- (dfBig$stat1.x * dfBig$n.x + dfBig$stat1.y * dfBig$n.y) / (dfBig$n.x +dfBig$n.y)
        dfBig$stat2Mean <- (dfBig$stat1.x * dfBig$n.x + dfBig$stat1.y * dfBig$n.y) / (dfBig$n.x +dfBig$n.y)
        dfBig$stat3Mean <- (dfBig$stat1.x * dfBig$n.x + dfBig$stat1.y * dfBig$n.y) / (dfBig$n.x +dfBig$n.y)
        dfBig$n2 <- dfBig$n.x + dfBig$n.y

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM