简体   繁体   English

从R中的各个单元格中提取摘要数据

[英]Extracting summary data from various cells in R

Here is the Data: 这是数据:

data <-data.frame(
    "start"= c("go",NA,NA,NA,"go",NA,"go"),
    "number"= c(31,32,1,29,61,17,72),
    "info"= c("c","k","s","u","b","i","n"))

   start number info
1    go     31    c
2  <NA>     32    k
3  <NA>      1    s
4  <NA>     29    u
5    go     61    b
6  <NA>     17    i
7    go     72    n

And I want to produce a summary table that prints the info on each line where Start = "go" 我想生成一个汇总表,在每行开始打印信息 ,其中Start =“go”

However, I want the Number column to be summed from all the cells after a "go" until the next "go" so that the results look as follows: 但是,我希望在“go”之后将Number列与所有单元格相加,直到下一个“go”,以便结果如下所示:

final <- data.frame(
"start"=c("go","go","go"),
"number"=c(93,78,72),
"info"=c("c","b","n"))

   start number info
1    go     93    c
2    go     78    b
3    go     72    n

Thanks for your help. 谢谢你的帮助。

A strategy in base R is to perform the subsetting and perform the summation in separate operations and then merge the results together. 基础R中的策略是执行子集化并在单独的操作中执行求和,然后将结果合并在一起。 Here, we can use cbind for the merge, since the two datasets were constructed to line up. 在这里,我们可以使用cbind进行合并,因为两个数据集的构造是为了排列。

cbind(data[!is.na(data$start), c(1, 3)],
      sum=aggregate(number ~ cumsum(!is.na(start)), data=data, sum)[,2])
  start info sum
1    go    c  93
5    go    b  78
7    go    n  72

I use !is.na to select the proper rows, which works in this example. 我使用!is.na来选择适当的行,这在本例中有效。 If you have other, non-NA values you want excluded, you can use !is.na(data$start) & data$start == "go" . 如果您想要排除其他非NA值,可以使用!is.na(data$start) & data$start == "go" aggregate is used to perform the summation and the grouping in the second argument uses the same method, but performs a cumulative sum on the result. aggregate用于执行求和,第二个参数中的分组使用相同的方法,但对结果执行累积求和。

You could use dplyr: 你可以使用dplyr:

data <-data.frame(
  start= c("go",NA,NA,NA,"go",NA,"go"),
  number= c(31,32,1,29,61,17,72),
  info= c("c","k","s","u","b","i","n"),stringsAsFactors = F)

library(dplyr)
data$group = cumsum(!is.na(data$start))
data %>% group_by(group) %>% summarize(n=sum(number), info=info[1])

Output 产量

  group     n  info
1     1    93     c
2     2    78     b
3     3    72     n

Optionally you could add 您可以选择添加

 %>% mutate(start="go") %>% select(-group)

to get to your requested output, but I am not sure if that actually adds value. 得到你要求的输出,但我不确定这是否真的增加了价值。 Hope this helps! 希望这可以帮助!

Here is an option using data.table 这是一个使用data.table的选项

library(data.table)
setDT(data)[, .(start = start[!is.na(start)], n = sum(number), 
     info = info[1]), .(grp = cumsum(!is.na(start)))][, grp := NULL][]
#   start  n info
#1:    go 93    c
#2:    go 78    b
#3:    go 72    n

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM