I have a data.frame like this
df=data.frame(
grp=c("group1","s1","s2","s3","s4","s5","group2","s6","s7","s8","group2","s9","s10","group3","s11","s12","s13","s14"),
gname=c("gene1",0.00,0.05,0.01,0.01,0.01,"gene1",0.063,0.005,0.015,"gene2",0.07,0.00,"gene3",0.046,0.007,0.011,0.012),
score=c(0.989003844,NA,NA,NA,NA,NA,0.988334014,NA,NA,NA,0.983461712,NA,NA,0.982339339,NA,NA,NA,NA)
)
> df
grp gname score
1 group1 gene1 0.9890038
2 s1 0 NA
3 s2 0.05 NA
4 s3 0.01 NA
5 s4 0.01 NA
6 s5 0.01 NA
7 group2 gene1 0.9883340
8 s6 0.063 NA
9 s7 0.005 NA
10 s8 0.015 NA
11 group2 gene2 0.9834617
12 s9 0.07 NA
13 s10 0 NA
14 group3 gene3 0.9823393
15 s11 0.046 NA
16 s12 0.007 NA
17 s13 0.011 NA
18 s14 0.012 NA
based on the group and gene names, the df could be divided to 4 section.the following picture shows this 4 section.
I am going to aggregate the df
for each sections to find the max
of df$score
and length
of df$grp
based on columns df$grp
and df$gname
. the following df shows the Expected result.
grp gname max.score length
group1 gene1 0.989003844 5
group2 gene1 0.988334014 3
group2 gene2 0.983461712 2
group3 gene3 0.982339339 4
and the following picture shows how the result is earned.
how could I perform aggregate(score~grp+gname,df,max)
and aggregate(grp~grp+gname,df,length)
for each section and save the results in a data.frame.
If you know that each group starts with a non missing score, followed by missing values, then a combination of cumsum/is.na
and tapply
will do the trick.
Start by creating an aggregation variable f
.
f <- cumsum(!is.na(df$score))
Now see what are the results lengths. The top row of numbers are the values of the "names"
attribute, the lengths are the bottom row. These lengths include the "group*"
row, so in the final dataframe, subtract 1.
tapply(f, f, length)
#1 2 3 4
#6 4 3 5
Create the result the question asks for.
result <- cbind(df[!is.na(df$score), ], length = tapply(f, f, length) - 1)
result
# grp gname score length
#1 group1 gene1 0.9890038 5
#7 group2 gene1 0.9883340 3
#11 group2 gene2 0.9834617 2
#14 group3 gene3 0.9823393 4
If you further want consecutive row names,
row.names(result) <- NULL
An option with tidyverse
library(dplyr)
df %>%
group_by(grp1 = cumsum(grepl("group", grp))) %>%
mutate(length = n() -1) %>%
slice(1) %>%
ungroup %>%
select(-grp1)
# A tibble: 4 x 4
# grp gname score length
# <fct> <fct> <dbl> <dbl>
#1 group1 gene1 0.989 5
#2 group2 gene1 0.988 3
#3 group2 gene2 0.983 2
#4 group3 gene3 0.982 4
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.