I have a table which looks like this but with many more entries:
ID Gene Tier Consequence
1314 ABC TIER1 missense
1314 PKD1 TIER1 frameshift
1314 PKD1 TIER1 stop_gain
6245 BJD TIER1 splice_site_variant
7631 PKD2 TIER1 missense
7631 PKD2 TIER1 non_coding
5336 PKD1 TIER3 missense
1399 PKD1 TIER2 non_coding
I would like to subset on hierarchy of consequence one row per ID with the consequence hierarchy: stop_gain > framshift > splice_site_variant > missense >n on_coding_mutation. In reality there are roughly 10 types of "Consequence" in a hierarchical order.
Desired outcome:
ID Gene Tier Consequence
1314 PKD1 TIER1 stop_gain
6245 BJD TIER1 splice_site_variant
7631 PKD2 TIER1 missense
5336 PKD1 TIER3 missense
1399 PKD1 TIER2 non_coding
I thought about turning the consequences into numbers and then using that but was wondering if there was a way of doing this with the text alone. I work in a HPC in an airlock environment so solutions using base R would be preferable.
Many thanks for your time
You can do this in base R and stick to using text if you convert Consequence
to an ordered factor:
df$Consequence <- ordered(df$Consequence,
levels = rev(c("stop_gain", "frameshift",
"splice_site_variant",
"missense", "non_coding")))
Then you can get the maximum in each group in a number of ways. For example, using the split-apply-bind
approach:
do.call(rbind, lapply(split(df, df$ID), function(x) x[which.max(x$Consequence),]))
#> ID Gene Tier Consequence
#> 1314 1314 PKD1 TIER1 stop_gain
#> 1399 1399 PKD1 TIER2 non_coding
#> 5336 5336 PKD1 TIER3 missense
#> 6245 6245 BJD TIER1 splice_site_variant
#> 7631 7631 PKD2 TIER1 missense
I would suggest using a factor to build a rank based on your criteria and then use filter()
from dplyr
in order to subset. Here the code. The logic is similar to @AllanCameron solution but I transformed the factor to numeric and then filtered:
library(dplyr)
#Data
#Define order
vord <- c('stop_gain','frameshift','splice_site_variant','missense','non_coding')
#Format data
df$Consequence <- factor(df$Consequence,levels = vord,ordered = T)
#Compute index
df$Index <- as.numeric(df$Consequence)
#Filter
df %>% group_by(ID) %>% filter(Index==min(Index)) %>% select(-Index)
Output:
# A tibble: 5 x 4
# Groups: ID [5]
ID Gene Tier Consequence
<int> <chr> <chr> <ord>
1 1314 PKD1 TIER1 stop_gain
2 6245 BJD TIER1 splice_site_variant
3 7631 PKD2 TIER1 missense
4 5336 PKD1 TIER3 missense
5 1399 PKD1 TIER2 non_coding
Some data used:
#Data
df <- structure(list(ID = c(1314L, 1314L, 1314L, 6245L, 7631L, 7631L,
5336L, 1399L), Gene = c("ABC", "PKD1", "PKD1", "BJD", "PKD2",
"PKD2", "PKD1", "PKD1"), Tier = c("TIER1", "TIER1", "TIER1",
"TIER1", "TIER1", "TIER1", "TIER3", "TIER2"), Consequence = structure(c(4L,
2L, 1L, 3L, 4L, 5L, 4L, 5L), .Label = c("stop_gain", "frameshift",
"splice_site_variant", "missense", "non_coding"), class = c("ordered",
"factor")), Index = c(4, 2, 1, 3, 4, 5, 4, 5)), row.names = c(NA,
-8L), class = "data.frame")
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.