簡體   English   中英

R 查找字符串匹配多列,並選擇最右邊的列匹配

[英]R Find string matches multiple columns, and choose most right column match

我有一個找不到解決方案的問題。這是一些示例數據:

df<-data.frame(ID1=c("A10","B73","B73","D20"),
               ID2=c(NA,"B4","C05","D100"),
               ID3=c(NA,"B20","C30","D41"),
               ID4=c(NA,NA,"B40","D0"),
               ID5=c(NA,NA,NA,"D10"),
               Score=c(15,376,102,30))
>df
  ID1  ID2  ID3  ID4  ID5 Score
1 A10 <NA> <NA> <NA> <NA>    15
2 B73   B4  B20 <NA> <NA>   376
3 B73  C05  C30  B40 <NA>   102
4 D20 D100  D41   D0  D10    30

我還有具有不同 ID 號的數據,這些數據與df中的某些ID和匹配的Score相匹配。 它看起來像這樣:

df_match<-data.frame(ID_Match=c("A10","B4","B20","E20","A355","D0","C30"),
               Score_Match=c(30,55,200,120,113,23,98))
>df_match
  ID_Match Score_Match
1      A10          30
2       B4          55
3      B20         200
4      E20         120
5     A355         113
6       D0          23
7      C30          98

我想要做的是讓 R 在df搜索 ID 匹配項,如果有匹配項,則將匹配的ID en Score放在新列中。 如果一行包含多個 ID 匹配項,則選擇最右側列的 ID 匹配項。 所以它看起來像這樣:

> df_Final
  ID1  ID2  ID3  ID4  ID5 Score ID_Match Score_Match
1 A10 <NA> <NA> <NA> <NA>    15      A10          30
2 B73   B4  B20 <NA> <NA>   376      B20         200
3 B73  C05  C30  B40 <NA>   102      C30          98
4 D20 D100  D41   D0  D10    30       D0          23

我找到了以下答案:

IDColumns <- 1:5
d <- df[,IDColumns] == "ID"

或者

df$Check <- (rowSums(df[,startsWith(names(df),"ID")]=="ID") >= 1)

但是我發現的大多數答案只搜索一個特定字符串的匹配項。 有人可以幫助我嗎?

首先匹配矩陣會很有用。

MX <- t(apply(df[, -6], 1, function(x) x %in% df_match$ID_Match))

#       [,1]  [,2]  [,3]  [,4]  [,5]
# [1,]  TRUE FALSE FALSE FALSE FALSE
# [2,] FALSE  TRUE  TRUE FALSE FALSE
# [3,] FALSE FALSE  TRUE FALSE FALSE
# [4,] FALSE FALSE FALSE  TRUE FALSE

現在我們想要“最右邊的列”,我們可以在其中使用sum()

idx <- apply(MX, 1, function(x) {
  if (sum(x) > 1)
    tail(which(x == TRUE), 1)
  else if (sum(x) == 1)
    which(x == TRUE)
  else NA
})

最后只是cbind()使用%in%的相應值。

res <- cbind(df, 
             df_match[which(df_match$ID_Match %in% 
                              sapply(1:nrow(df), function(x) df[x, idx[x]])), ])

結果

> res
  ID1  ID2  ID3  ID4  ID5 Score ID_Match Score_Match
1 A10 <NA> <NA> <NA> <NA>    15      A10          30
3 B73   B4  B20 <NA> <NA>   376      B20         200
6 B73  C05  C30  B40 <NA>   102       D0          23
7 D20 D100  D41   D0  D10    30      C30          98

我不確定這在任何情況下是否有效,但也許它仍然有幫助

    df<-data.frame(ID1=c("A10","B73","B73","D20"),
               ID2=c(NA,"B4","C05","D100"),
               ID3=c(NA,"B20","C30","D41"),
               ID4=c(NA,NA,"B40","D0"),
               ID5=c(NA,NA,NA,"D10"),
               Score=c(15,376,102,30))


df_match<-data.frame(ID_Match=c("A10","B4","B20","E20","A355","D0","C30"),
                     Score_Match=c(30,55,200,120,113,23,98))

# create backup for the results
df2 = df

# create a dummy-column as an "ID" for each row
df$rownumber = 1:NROW(df)

# convert Data to longformat and get rid of all those IDs, that are NA
df = reshape2::melt(df, measure.vars = names(df)[which(names(df) != "rownumber")], id.vars = "rownumber", na.rm = T)
df %>% arrange(rownumber)

# find the matching scores for all IDs left
df = merge(df, df_match, by.x = "value", by.y = "ID_Match", all.x = T)
# remove all ids, that didn't have a match in df_match
df = df %>% filter(!is.na(Score_Match))
# remove the substring ID from each ID-Variable, so we can use it as a numeric
df$variable = as.numeric(as.character(gsub("ID", "", df$variable)))

# now lets select the ID most far right. Its the one with the highest ID<Number>
df = df %>% group_by(rownumber) %>% filter(variable == max(variable)) %>% arrange(rownumber)

# attach the data to the original file
df2$ID_Match = df$value
df2$score_Match = df$Score_Match
df2

# > df2
#   ID1  ID2  ID3  ID4  ID5 Score ID_Match score_Match
# 1 A10 <NA> <NA> <NA> <NA>    15      A10          30
# 2 B73   B4  B20 <NA> <NA>   376      B20         200
# 3 B73  C05  C30  B40 <NA>   102      C30          98
# 4 D20 D100  D41   D0  D10    30       D0          23

如果在任何 ID 中存在不匹配的行,這可能會帶來麻煩。 在這種情況下,添加 df2$rownumber = 1:NROW(df2) 並通過 rownumber 將 df 與 df2 匹配而不是直接附加可能會有所幫助(我希望 :))

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM