[英]Remove duplicated rows based on two columns of type “character” in R
我有一個 dataframe 並且想要刪除在兩列中有重復字符串的行(名為“Up”和“Down”)。 如果某些行僅在 2 列中的 1 列中具有重復的字符串值,則不應刪除它們。 從重復的行中,我想保留在另一列(名為“Fold”)上找到的最高值的行。 除了這個任務,第 4 列(名為“名稱”)還需要一些字符替換,如下所示:
由此:
ID Name Fold Up Down
1 mRNA_splicing(5) 3.2 a,b,c,d,e f,g,h,i
2 mRNA_processing(7) 3.1 a,b,c,d,e f,g,h,i
3 adherens_junctions(5) 2.6 k,l,m p,q,r,s,t,u
4 glucose_transport(4) 3.4 d,j,n o,p,v,w,z
5 hexose_transport(2) 3.5 d,j,n o,p,v,w,y,z
我想得到這個:
ID Name Fold Up Down
1 mRNA splicing 3.2 a,b,c,d,e f,g,h,i
2 adherens junctions 2.6 k,l,m p,q,r,s,t,u
3 glucose transport 3.4 d,j,n o,p,v,w,z
4 hexose transport 3.5 d,j,n o,p,v,w,y,z
關於執行刪除重復行的函數,對於字符既不duplicate
也不unique
,那么這里該怎么辦? 我很欣賞你優雅的解決方案。
使用數據表解決方案:
dt <- as.data.table(your_df)
dt <- dt[dt[, .I[Fold == max(Fold)], by=list(Up, Down)]$V1]
dt[["Name"]] <- gsub("_", " ", sub("\\(.*?\\)$", "", dt[["Name"]]))
dt
ID Name Fold Up Down
1: 1 mRNA splicing 3.2 a,b,c,d,e f,g,h,i
2: 3 adherens junctions 2.6 k,l,m p,q,r,s,t,u
3: 4 glucose transport 3.4 d,j,n o,p,v,w,z
4: 5 hexose transport 3.5 d,j,n o,p,v,w,y,z
使用dplyr
+ stringr
(編輯以納入 tmfmnk 的建議):
df %>%
group_by(Up, Down) %>%
slice(which.max(Fold)) %>%
mutate(Name = str_remove(Name, "\\(.*?\\)"))
Output:
# A tibble: 4 x 5
# Groups: Up, Down [4]
ID Name Fold Up Down
<int> <chr> <dbl> <chr> <chr>
1 1 mRNA_splicing 3.2 a,b,c,d,e f,g,h,i
2 5 hexose_transport 3.5 d,j,n o,p,v,w,y,z
3 4 glucose_transport 3.4 d,j,n o,p,v,w,z
4 3 adherens_junctions 2.6 k,l,m p,q,r,s,t,u
base R
中的解決方案:
df <- df[order(df$Fold, decreasing = TRUE),]
df <- df[!(duplicated(df$Up) & duplicated(df$Down)),]
df$Name <- gsub("_", " ", gsub('.{0,3}$', '', df$Name))
df <- df[order(df$ID),]
數據
df <- read.table(text = "
ID Name Fold Up Down
1 mRNA_splicing(5) 3.2 a,b,c,d,e f,g,h,i
2 mRNA_processing(7) 3.1 a,b,c,d,e f,g,h,i
3 adherens_junctions(5) 2.6 k,l,m p,q,r,s,t,u
4 glucose_transport(4) 3.4 d,j,n o,p,v,w,z
5 hexose_transport(2) 3.5 d,j,n o,p,v,w,y,z
", header = TRUE)
df$Name <- as.character(df$Name)
Output
ID Name Fold Up Down
1 1 mRNA splicing 3.2 a,b,c,d,e f,g,h,i
3 3 adherens junctions 2.6 k,l,m p,q,r,s,t,u
4 4 glucose transport 3.4 d,j,n o,p,v,w,z
5 5 hexose transport 3.5 d,j,n o,p,v,w,y,z
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.