[英]Keep rows in dataframe containing the SAME value in one column in R
考慮以下數據框:
Gene <- c("PNKD;TMBIM1", "PNKD", "PKHD1", "PKHD1", "SCN1A", "RBMX", "RBMX", "MUC4", "CASKIN;TRAF7", "CASKIN", "LIFR")
Score <- c(0.9, 0.2, 0.5, 0.6, 0.1, 0.985, 0.238, 0.65, 0.9, 0.66, 0.6)
df <- data.frame(Gene, Score)
df
我想在包含相同字符串的“Gene”列中選擇此數據框中的行。 我想要以下輸出:
Gene <- c("PNKD;TMBIM1", "PNKD", "PKHD1", "PKHD1", "RBMX", "RBMX","CASKIN;TRAF7", "CASKIN")
Score <- c(0.9, 0.2, 0.5, 0.6, 0.985, 0.238, 0.65, 0.9)
df <- data.frame(Gene, Score)
df
你的意思是像下面這樣嗎
subset(
df,
grepl(
paste0(subset(data.frame(table(unlist(strsplit(Gene, ";")))), Freq > 1)$Var1, collapse = "|"),
Gene
)
)
這使
Gene Score
1 PNKD;TMBIM1 0.900
2 PNKD 0.200
3 PKHD1 0.500
4 PKHD1 0.600
6 RBMX 0.985
7 RBMX 0.238
9 CASKIN;TRAF7 0.900
10 CASKIN 0.660
這不是我認為最好的處理方式,但使用BaseR
,
map <- unique(df[colSums(sapply(df[,1], function(x) grepl(x,df[,1])))>1,1])
do.call(rbind,lapply(map,function(x) df[grepl(x,df[,1]),]))
給,
Gene Score
1 PNKD;TMBIM1 0.900
2 PNKD 0.200
3 PKHD1 0.500
4 PKHD1 0.600
6 RBMX 0.985
7 RBMX 0.238
9 CASKIN;TRAF7 0.900
10 CASKIN 0.660
使用tidyverse
您可以在對行進行編號后執行以下操作:
library(tidyverse)
df$gene_num = seq.int(nrow(df))
df_keep <- df %>%
separate_rows(Gene, sep = ";") %>%
group_by(Gene) %>%
filter(n() > 1) %>%
pull(gene_num)
df[df_keep, c("Gene", "Score")]
輸出
Gene Score
1 PNKD;TMBIM1 0.900
2 PNKD 0.200
3 PKHD1 0.500
4 PKHD1 0.600
6 RBMX 0.985
7 RBMX 0.238
9 CASKIN;TRAF7 0.900
10 CASKIN 0.660
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.