[英]R Find string matches multiple columns, and choose most right column match
我有一個找不到解決方案的問題。這是一些示例數據:
df<-data.frame(ID1=c("A10","B73","B73","D20"),
ID2=c(NA,"B4","C05","D100"),
ID3=c(NA,"B20","C30","D41"),
ID4=c(NA,NA,"B40","D0"),
ID5=c(NA,NA,NA,"D10"),
Score=c(15,376,102,30))
>df
ID1 ID2 ID3 ID4 ID5 Score
1 A10 <NA> <NA> <NA> <NA> 15
2 B73 B4 B20 <NA> <NA> 376
3 B73 C05 C30 B40 <NA> 102
4 D20 D100 D41 D0 D10 30
我還有具有不同 ID 號的數據,這些數據與df
中的某些ID
和匹配的Score
相匹配。 它看起來像這樣:
df_match<-data.frame(ID_Match=c("A10","B4","B20","E20","A355","D0","C30"),
Score_Match=c(30,55,200,120,113,23,98))
>df_match
ID_Match Score_Match
1 A10 30
2 B4 55
3 B20 200
4 E20 120
5 A355 113
6 D0 23
7 C30 98
我想要做的是讓 R 在df
搜索 ID 匹配項,如果有匹配項,則將匹配的ID
en Score
放在新列中。 如果一行包含多個 ID 匹配項,則選擇最右側列的 ID 匹配項。 所以它看起來像這樣:
> df_Final
ID1 ID2 ID3 ID4 ID5 Score ID_Match Score_Match
1 A10 <NA> <NA> <NA> <NA> 15 A10 30
2 B73 B4 B20 <NA> <NA> 376 B20 200
3 B73 C05 C30 B40 <NA> 102 C30 98
4 D20 D100 D41 D0 D10 30 D0 23
我找到了以下答案:
IDColumns <- 1:5
d <- df[,IDColumns] == "ID"
或者
df$Check <- (rowSums(df[,startsWith(names(df),"ID")]=="ID") >= 1)
但是我發現的大多數答案只搜索一個特定字符串的匹配項。 有人可以幫助我嗎?
首先匹配矩陣會很有用。
MX <- t(apply(df[, -6], 1, function(x) x %in% df_match$ID_Match))
# [,1] [,2] [,3] [,4] [,5]
# [1,] TRUE FALSE FALSE FALSE FALSE
# [2,] FALSE TRUE TRUE FALSE FALSE
# [3,] FALSE FALSE TRUE FALSE FALSE
# [4,] FALSE FALSE FALSE TRUE FALSE
現在我們想要“最右邊的列”,我們可以在其中使用sum()
。
idx <- apply(MX, 1, function(x) {
if (sum(x) > 1)
tail(which(x == TRUE), 1)
else if (sum(x) == 1)
which(x == TRUE)
else NA
})
最后只是cbind()
使用%in%
的相應值。
res <- cbind(df,
df_match[which(df_match$ID_Match %in%
sapply(1:nrow(df), function(x) df[x, idx[x]])), ])
結果
> res
ID1 ID2 ID3 ID4 ID5 Score ID_Match Score_Match
1 A10 <NA> <NA> <NA> <NA> 15 A10 30
3 B73 B4 B20 <NA> <NA> 376 B20 200
6 B73 C05 C30 B40 <NA> 102 D0 23
7 D20 D100 D41 D0 D10 30 C30 98
我不確定這在任何情況下是否有效,但也許它仍然有幫助
df<-data.frame(ID1=c("A10","B73","B73","D20"),
ID2=c(NA,"B4","C05","D100"),
ID3=c(NA,"B20","C30","D41"),
ID4=c(NA,NA,"B40","D0"),
ID5=c(NA,NA,NA,"D10"),
Score=c(15,376,102,30))
df_match<-data.frame(ID_Match=c("A10","B4","B20","E20","A355","D0","C30"),
Score_Match=c(30,55,200,120,113,23,98))
# create backup for the results
df2 = df
# create a dummy-column as an "ID" for each row
df$rownumber = 1:NROW(df)
# convert Data to longformat and get rid of all those IDs, that are NA
df = reshape2::melt(df, measure.vars = names(df)[which(names(df) != "rownumber")], id.vars = "rownumber", na.rm = T)
df %>% arrange(rownumber)
# find the matching scores for all IDs left
df = merge(df, df_match, by.x = "value", by.y = "ID_Match", all.x = T)
# remove all ids, that didn't have a match in df_match
df = df %>% filter(!is.na(Score_Match))
# remove the substring ID from each ID-Variable, so we can use it as a numeric
df$variable = as.numeric(as.character(gsub("ID", "", df$variable)))
# now lets select the ID most far right. Its the one with the highest ID<Number>
df = df %>% group_by(rownumber) %>% filter(variable == max(variable)) %>% arrange(rownumber)
# attach the data to the original file
df2$ID_Match = df$value
df2$score_Match = df$Score_Match
df2
# > df2
# ID1 ID2 ID3 ID4 ID5 Score ID_Match score_Match
# 1 A10 <NA> <NA> <NA> <NA> 15 A10 30
# 2 B73 B4 B20 <NA> <NA> 376 B20 200
# 3 B73 C05 C30 B40 <NA> 102 C30 98
# 4 D20 D100 D41 D0 D10 30 D0 23
如果在任何 ID 中存在不匹配的行,這可能會帶來麻煩。 在這種情況下,添加 df2$rownumber = 1:NROW(df2) 並通過 rownumber 將 df 與 df2 匹配而不是直接附加可能會有所幫助(我希望 :))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.