聚合R中的数据帧子集

Question

我有数据框ds

CountyID  ZipCode   Value1    Value2    Value3 ...   Value25
   1        1         0        etc        etc          etc
   2        1         3       
   3        1         0       
   4        1         1       
   5        2         2       
   6        3         3       
   7        4         7
   8        4         2       
   9        5         1       
   10       6         0

并希望根据ds$ZipCode进行汇总，并根据最高的ds$Value1将ds$CountyID设置ds$CountyID等于主要县。 对于上面的示例，它看起来像这样：

CountyID  ZipCode   Value1    Value2    Value3 ...   Value25
   2        1         4        etc        etc          etc
   5        2         2       
   6        3         3       
   7        4         9       
   9        5         1       
   10       6         0

所有ValueX列都是按ZipCode分组的该列的总和。

在过去的几天里，我尝试了很多不同的策略，但是没有一个起作用。 我想出的最好的是

#initialize the dataframe
ds_temp = data.frame()

#loop through each subset based on unique zipcodes
for (zip in unique(ds$ZipCode) {

    sub <- subset(ds, ds$ZipCode == zip)                                           
    len <- length(sub)                                                             
    maxIndex <- which.max(sub$Value1)                          

    #do the aggregation  
    row <- aggregate(sub[3:27], FUN=sum, by=list(                                  
        CountyID = rep(sub$CountyID[maxIndex], len),                           
        ZipCode = sub$ZipCode))                

    rbind(ds_temp, row)                                                            
}                                                                                  

ds <- ds_temp

我无法在真实数据上对此进行测试，但是使用伪数据集（例如上面的数据集），我不断收到错误消息“参数必须具有相同的长度）。我搞砸了rep（）和固定向量（例如c(1,2,3,4) ），但无论我做什么，该错误仍然存在，有时我也会因

不能子集“ closure”类型的数据。

有任何想法吗？ 我也尝试过弄乱data.frame() ， ddply() ， data.table() ， dcast()等。

Answer 1

您可以尝试以下方法：

data.frame(aggregate(df[,3:27], by=list(df$ZipCode), sum),
  CountyID = unlist(lapply(split(df, df$ZipCode), 
    function(x) x$CountyID[which.max(x$Value1)])))

完全可重现的样本数据：

df<-read.table(text="
CountyID  ZipCode   Value1    
   1        1         0   
   2        1         3       
   3        1         0       
   4        1         1       
   5        2         2       
   6        3         3       
   7        4         7
   8        4         2       
   9        5         1       
   10       6         0", header=TRUE)

data.frame(aggregate(df[,3], by=list(df$ZipCode), sum),
  CountyID = unlist(lapply(split(df, df$ZipCode), 
    function(x) x$CountyID[which.max(x$Value1)])))

#  Group.1 x CountyID
#1       1 4        2
#2       2 2        5
#3       3 3        6
#4       4 9        7
#5       5 1        9
#6       6 0       10

Answer 2

在回答关于弗兰克的回答您的意见，您可以通过使用公式法保护的列名aggregate 。 使用弗兰克斯的数据df ，这将是

> cbind(aggregate(Value1 ~ ZipCode, df, sum), 
        CountyID = sapply(split(df, df$ZipCode), function(x) {
            with(x, CountyID[Value1 == max(Value1)]) }))
#   ZipCode Value1 CountyID
# 1       1      4        2
# 2       2      2        5
# 3       3      3        6
# 4       4      9        7
# 5       5      1        9
# 6       6      0       10

聚合R中的数据帧子集

问题描述

2 个解决方案

解决方案1
2 已采纳 2014-09-16 22:59:50

解决方案2
1 2014-09-17 17:23:03

聚合R中的数据帧子集

问题描述

2 个解决方案

解决方案1 2 已采纳 2014-09-16 22:59:50

解决方案2 1 2014-09-17 17:23:03

解决方案1
2 已采纳 2014-09-16 22:59:50

解决方案2
1 2014-09-17 17:23:03