简体   繁体   中英

how to aggregate a data.frame when group names are presented in different rows

I have a data.frame like this

df=data.frame(
grp=c("group1","s1","s2","s3","s4","s5","group2","s6","s7","s8","group2","s9","s10","group3","s11","s12","s13","s14"),
gname=c("gene1",0.00,0.05,0.01,0.01,0.01,"gene1",0.063,0.005,0.015,"gene2",0.07,0.00,"gene3",0.046,0.007,0.011,0.012),
score=c(0.989003844,NA,NA,NA,NA,NA,0.988334014,NA,NA,NA,0.983461712,NA,NA,0.982339339,NA,NA,NA,NA)
)

> df
      grp gname      score
1  group1 gene1 0.9890038
2      s1     0        NA
3      s2  0.05        NA
4      s3  0.01        NA
5      s4  0.01        NA
6      s5  0.01        NA
7  group2 gene1 0.9883340
8      s6 0.063        NA
9      s7 0.005        NA
10     s8 0.015        NA
11 group2 gene2 0.9834617
12     s9  0.07        NA
13    s10     0        NA
14 group3 gene3 0.9823393
15    s11 0.046        NA
16    s12 0.007        NA
17    s13 0.011        NA
18    s14 0.012        NA

based on the group and gene names, the df could be divided to 4 section.the following picture shows this 4 section.

在此输入图像描述

I am going to aggregate the df for each sections to find the max of df$score and length of df$grp based on columns df$grp and df$gname . the following df shows the Expected result.

grp     gname   max.score   length
group1  gene1   0.989003844   5
group2  gene1   0.988334014   3
group2  gene2   0.983461712   2
group3  gene3   0.982339339   4

and the following picture shows how the result is earned. 在此输入图像描述

how could I perform aggregate(score~grp+gname,df,max) and aggregate(grp~grp+gname,df,length) for each section and save the results in a data.frame.

If you know that each group starts with a non missing score, followed by missing values, then a combination of cumsum/is.na and tapply will do the trick.

Start by creating an aggregation variable f .

f <- cumsum(!is.na(df$score))

Now see what are the results lengths. The top row of numbers are the values of the "names" attribute, the lengths are the bottom row. These lengths include the "group*" row, so in the final dataframe, subtract 1.

tapply(f, f, length)
#1 2 3 4 
#6 4 3 5 

Create the result the question asks for.

result <- cbind(df[!is.na(df$score), ], length = tapply(f, f, length) - 1)

result
#      grp gname     score length
#1  group1 gene1 0.9890038      5
#7  group2 gene1 0.9883340      3
#11 group2 gene2 0.9834617      2
#14 group3 gene3 0.9823393      4

If you further want consecutive row names,

row.names(result) <- NULL

An option with tidyverse

library(dplyr)
df %>% 
  group_by(grp1 = cumsum(grepl("group", grp))) %>%
  mutate(length = n() -1) %>%
  slice(1) %>%
  ungroup %>%
  select(-grp1)
# A tibble: 4 x 4
#  grp    gname score length
#  <fct>  <fct> <dbl>  <dbl>
#1 group1 gene1 0.989      5
#2 group2 gene1 0.988      3
#3 group2 gene2 0.983      2
#4 group3 gene3 0.982      4

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM