简体   繁体   中英

Hierarchical sorting of a table in R

I have a table which looks like this but with many more entries:

ID     Gene     Tier     Consequence   
1314   ABC      TIER1    missense  
1314   PKD1     TIER1    frameshift 
1314   PKD1     TIER1    stop_gain 
6245   BJD      TIER1    splice_site_variant 
7631   PKD2     TIER1    missense
7631   PKD2     TIER1    non_coding
5336   PKD1     TIER3    missense
1399   PKD1     TIER2    non_coding

I would like to subset on hierarchy of consequence one row per ID with the consequence hierarchy: stop_gain > framshift > splice_site_variant > missense >n on_coding_mutation. In reality there are roughly 10 types of "Consequence" in a hierarchical order.

Desired outcome:

ID Gene Tier Consequence
1314 PKD1 TIER1 stop_gain
6245 BJD  TIER1 splice_site_variant
7631 PKD2 TIER1 missense
5336 PKD1 TIER3 missense
1399 PKD1 TIER2 non_coding

I thought about turning the consequences into numbers and then using that but was wondering if there was a way of doing this with the text alone. I work in a HPC in an airlock environment so solutions using base R would be preferable.

Many thanks for your time

You can do this in base R and stick to using text if you convert Consequence to an ordered factor:

df$Consequence <- ordered(df$Consequence, 
                          levels = rev(c("stop_gain", "frameshift", 
                                         "splice_site_variant", 
                                         "missense", "non_coding")))

Then you can get the maximum in each group in a number of ways. For example, using the split-apply-bind approach:

do.call(rbind, lapply(split(df, df$ID), function(x) x[which.max(x$Consequence),]))
#>        ID Gene  Tier         Consequence
#> 1314 1314 PKD1 TIER1           stop_gain
#> 1399 1399 PKD1 TIER2          non_coding
#> 5336 5336 PKD1 TIER3            missense
#> 6245 6245  BJD TIER1 splice_site_variant
#> 7631 7631 PKD2 TIER1            missense

I would suggest using a factor to build a rank based on your criteria and then use filter() from dplyr in order to subset. Here the code. The logic is similar to @AllanCameron solution but I transformed the factor to numeric and then filtered:

library(dplyr)
#Data
#Define order
vord <- c('stop_gain','frameshift','splice_site_variant','missense','non_coding')
#Format data
df$Consequence <- factor(df$Consequence,levels = vord,ordered = T)
#Compute index
df$Index <- as.numeric(df$Consequence)
#Filter
df %>% group_by(ID) %>% filter(Index==min(Index)) %>% select(-Index)
 

Output:

# A tibble: 5 x 4
# Groups:   ID [5]
     ID Gene  Tier  Consequence        
  <int> <chr> <chr> <ord>              
1  1314 PKD1  TIER1 stop_gain          
2  6245 BJD   TIER1 splice_site_variant
3  7631 PKD2  TIER1 missense           
4  5336 PKD1  TIER3 missense           
5  1399 PKD1  TIER2 non_coding  

Some data used:

#Data
df <- structure(list(ID = c(1314L, 1314L, 1314L, 6245L, 7631L, 7631L, 
5336L, 1399L), Gene = c("ABC", "PKD1", "PKD1", "BJD", "PKD2", 
"PKD2", "PKD1", "PKD1"), Tier = c("TIER1", "TIER1", "TIER1", 
"TIER1", "TIER1", "TIER1", "TIER3", "TIER2"), Consequence = structure(c(4L, 
2L, 1L, 3L, 4L, 5L, 4L, 5L), .Label = c("stop_gain", "frameshift", 
"splice_site_variant", "missense", "non_coding"), class = c("ordered", 
"factor")), Index = c(4, 2, 1, 3, 4, 5, 4, 5)), row.names = c(NA, 
-8L), class = "data.frame")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM