简体   繁体   English

R中的组/ bin / bucket数据,每桶获取计数和每个桶的值总和

[英]Group/bin/bucket data in R and get count per bucket and sum of values per bucket

I wish to bucket/group/bin data : 我想分组/分组/ bin数据:

C1             C2       C3
49488.01172    0.0512   54000
268221.1563    0.0128   34399
34775.96094    0.0128   54444
13046.98047    0.07241  61000
2121699.75     0.00453  78921
71155.09375    0.0181   13794
1369809.875    0.00453  12312
750            0.2048   43451
44943.82813    0.0362   49871
85585.04688    0.0362   18947
31090.10938    0.0362   13401
68550.40625    0.0181   14345

I want to bucket it by C2 values but I wish to define the buckets eg <=0.005, <=.010, <=.014 etc. As you can see, the bucketing will be uneven intervals. 我想用C2值进行存储,但我希望定义存储桶,例如<= 0.005,<=。010,<=。014等。正如您所看到的,存储区间将是不均匀的。 I want the count of C1 per bucket as well as the total sum of C1 for every bucket. 我想要每桶的C1计数以及每个桶的C1总和。

I don't know where to begin as I am fairly new a user of R. Is there anyone willing to help me figure out the code or direct to me to an example that will work for my needs? 我不知道从哪里开始,因为我是一个相当新的R用户。有没有人愿意帮我弄清楚代码或指导我找一个能满足我需求的例子?

EDIT: added another column C3. 编辑:添加了另一列C3。 I need sum of C3 per bucket as well at the same time as sum and count of C1 per bucket 我需要每桶的C3总和以及每桶的C1和数量

From the comments, "C2" seems to be "character" column with % as suffix. 从评论中,“C2”似乎是“字符”列,后缀为% Before, creating a group, remove the % using sub , convert to "numeric" ( as.numeric ). 在创建组之前,删除% using sub ,转换为“numeric”( as.numeric )。 The variable "group" is created ( transform(df,...) ) by using the function cut with breaks (group buckets/intervals) and labels (for the desired group labels) arguments. 通过使用带有breaks (组桶/间隔)和labels (用于所需的组标签)参数的函数cut来创建变量“group”( transform(df,...) )。 Once the group variable is created, the sum of the "C1" by "group" and the "count" of elements within "group" can be done using aggregate from "base R" 一旦组变量被创建,所述sum的“C1”,由“基团”和“基团”中的元素“计数”的可利用来完成aggregate从“基R”

df1 <-  transform(df, group=cut(as.numeric(sub('[%]', '', C2)), 
    breaks=c(-Inf,0.005, 0.010, 0.014, Inf),
      labels=c('<0.005', 0.005, 0.01, 0.014)))

 res <- do.call(data.frame,aggregate(C1~group, df1, 
        FUN=function(x) c(Count=length(x), Sum=sum(x))))

 dNew <- data.frame(group=levels(df1$group))
 merge(res, dNew, all=TRUE)
 #   group C1.Count    C1.Sum
 #1 <0.005        2 3491509.6
 #2  0.005       NA        NA
 #3   0.01        2  302997.1
 #4  0.014        8  364609.5

or you can use data.table . 或者您可以使用data.table setDT converts the data.frame to data.table . setDT的转换data.framedata.table Specify the "grouping" variable with by= and summarize/create the two variables "Count" and "Sum" within the list( . .N gives the count of elements within each "group". 使用by=指定“grouping”变量,并在list(汇总/创建两个变量“Count”和“Sum” list( . .N给出每个“group”中元素的计数。

 library(data.table)
  setDT(df1)[, list(Count=.N, Sum=sum(C1)), by=group][]

Or using dplyr . 或者使用dplyr The %>% connect the LHS with RHS arguments and chains them together. %>%将LHS与RHS参数连接起来并将它们链接在一起。 Use group_by to specify the "group" variable, and then use summarise_each or summarise to get summary count and sum of the concerned column. 使用group_by指定“组”变量,然后用summarise_eachsummarise得到汇总数量和sum有关列。 summarise_each would be useful if there are more than one column. 如果有多个列, summarise_each将非常有用。

 library(dplyr)
 df1 %>%
      group_by(group) %>% 
      summarise_each(funs(n(), Sum=sum(.)), C1)

Update 更新

Using the new dataset df 使用新数据集df

df1 <- transform(df, group=cut(C2,  breaks=c(-Inf,0.005, 0.010, 0.014, Inf),
                             labels=c('<0.005', 0.005, 0.01, 0.014)))

res <- do.call(data.frame,aggregate(cbind(C1,C3)~group, df1, 
       FUN=function(x) c(Count=length(x), Sum=sum(x))))
res
#  group C1.Count    C1.Sum C3.Count C3.Sum
#1 <0.005        2 3491509.6        2  91233
#2   0.01        2  302997.1        2  88843
#3  0.014        8  364609.5        8 268809

and you can do the merge as detailed above. 你可以按照上面的详细说明进行merge

The dplyr approach would be the same except specifying the additional variable 除了指定附加变量之外, dplyr方法是相同的

 df1%>%
      group_by(group) %>%
       summarise_each(funs(n(), Sum=sum(.)), C1, C3)
 #Source: local data frame [3 x 5]

 #  group C1_n C3_n    C1_Sum C3_Sum
 #1 <0.005    2    2 3491509.6  91233
 #2   0.01    2    2  302997.1  88843
 #3  0.014    8    8  364609.5 268809

data 数据

df <-structure(list(C1 = c(49488.01172, 268221.1563, 34775.96094, 
13046.98047, 2121699.75, 71155.09375, 1369809.875, 750, 44943.82813, 
85585.04688, 31090.10938, 68550.40625), C2 = c("0.0512%", "0.0128%", 
"0.0128%", "0.07241%", "0.00453%", "0.0181%", "0.00453%", "0.2048%", 
"0.0362%", "0.0362%", "0.0362%", "0.0181%")), .Names = c("C1", 
"C2"), row.names = c(NA, -12L), class = "data.frame")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM