简体   繁体   中英

Combine rows with partially duplicated information

I have some data frames look like:

data.frame(chr=c(3,3,3,1,1),start=c(15,52,17,1,80),end=c(52,68,18,15,92),strand=c("+","+","+","-","-"),item=c("A","A","B","C","C"))
        chr start end strand item
1        3    15  52      +     A
2        3    52  68      +     A
3        3    17  18      +     B
4        1     1  15      -     C
5        1    80  92      -     C

Item A and C could have two or more different starts and ends, but the rest columns are same inside each group. Is there a way to concatenate the start and stop information like this?

        chr start   end strand item
1        3 15,52 52,68      +     A
2        3    17    18      +     B
3        1  1,80 15,92      -     C

Thanks for your help!

We can group by 'chr', 'strand', 'item', and paste the 'start', 'end' values with toString (=> paste(., collapse=", ") )

library(dplyr)
df1 %>%
    group_by(chr, strand, item) %>% 
    summarise(across(c(start, end), toString), .groups = 'drop') %>%
    arrange(item)

-output

# A tibble: 3 x 5
#    chr strand item  start  end   
#  <dbl> <chr>  <chr> <chr>  <chr> 
#1     3 +      A     15, 52 52, 68
#2     3 +      B     17     18    
#3     1 -      C     1, 80  15, 92

Or using base R with aggregate

aggregate(cbind(start, end) ~ ., df1, toString)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM