简体   繁体   中英

R grouping data by numeric numbers in a column

I am trying to group data by numbers in a column, I have tried different versions of group_by, cut, group etc but I have not been able to get it. I have a lot of data that looks like this:

  position variants

     3      snv
     5      snv
    12      snv
    17      mnv
    22 deletion
    27      snv
    33      snv
    35      snv
    42      snv
    46      mnv
    50      snv
    53 deletion
    60      snv
    62      snv
    65      snv
    70      snv
variants <- c(rep("snv", 3),rep("mnv", 1),rep("deletion", 1),rep("snv", 4), "mnv", rep("snv"), "deletion", rep("snv", 4))
variants              
position = c(3, 5, 12, 17, 22, 27, 33, 35, 42, 46, 50, 53, 60, 62, 65, 70)
position
patient1 = data.frame(position, variants)
patient1

I would like to be able to group the data something like this:

group  tally
1-10    2snv
11-20   1snv 1mnv
21-30   1deletion 1snv
31-40   2snv 
etc

so that i can run further downstream analysis. And be able to change it to groups of 1-5 or 1-2 etc. thank you very much

Here a pure R solution. Of course you can replace variables by their corresponding calls:

variants <- c(rep("snv", 3),rep("mnv", 1),rep("deletion", 1),rep("snv", 4), "mnv", rep("snv"), "deletion", rep("snv", 4))
position = c(3, 5, 12, 17, 22, 27, 33, 35, 42, 46, 50, 53, 60, 62, 65, 70)
patient1 = data.frame(position, variants)

labels = cut(position, seq(0, max(position), 10))
groups = split(patient1 , labels)
lapply(groups , function(x) {
  paste( table(x$variants), names(table(x$variants)), collapse = ", " )
      }
  )

We can use tidvyerse to do a group by operation. Create a group of ranges with cut , summarise the frequency count based on the cut and the 'variants', then paste them together in summarise

library(dplyr)
patient1 %>% 
   group_by(group = cut(position, breaks = c(-Inf, seq(1, 100, 
       by = 10))), variants) %>%
   summarise(n = n()) %>%
   summarise(tally = paste(n, variants, collapse=' ', sep=""))

NOTE: Another option is findInterval which does similar option as cut but without the labels as it will output numeric index

In base R, you can create a group column using findInterval making groups of every 10 positions. We can then use aggregate and combine the count of variants with the variants to create one string for each group.

patient1$group <- with(patient1, findInterval(position, (seq(0, max(position), 10))))

aggregate(variants~group, patient1, function(x) {
  tb <- table(x)
  paste(tb, names(tb), collapse = ' ')
})

#  group         variants
#1     1            2 snv
#2     2      1 mnv 1 snv
#3     3 1 deletion 1 snv
#4     4            2 snv
#5     5      1 mnv 1 snv
#6     6 1 deletion 1 snv
#7     7            3 snv
#8     8            1 snv

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM