[英]Finding Matches Across Char Vectors in R
鑒於以下兩個向量,有沒有辦法生成所需的數據幀? 這代表了一個現實世界的情況,我必須數據幀第一個包含一個帶有數據庫值(鍵)的列,第二個包含一個包含 1000+ 行的列,每個文件名(可能)我需要匹配。 問題是可以有多個文件(可能)與任何給定的鍵匹配。 我曾使用過 grep、合並、內部連接等,但無法將它們合並到一個解決方案中。 任何建議表示贊賞!
potentials <- c("tigerINTHENIGHT",
"tigerWALKINGALONE",
"bearOHMY",
"bearWITHME",
"rat",
"imatchnothing")
keys <- c("tiger",
"bear",
"rat")
desired <- data.frame(keys, c("tigerINTHENIGHT, tigerWALKINGALONE", "bearOHMY, bearWITHME", "rat"))
names(desired) <- c("key", "matches")
我認為的解決方案的偽代碼:
#new column which is comma separated potentials
# x being the substring length i.e. x = 4 means true if first 4 letters match
function createNewColumn(keys, potentials, x){
str result = na
foreach(key in keys){
if(substring(key, 0, x) == any(substring(potentals, 0 ,x))){ //search entire potential vector
result += potential that matched + ', '
}
}
return new column with result as the value on the current row
}
我們可以編寫一個小函數來提取匹配項,然后遍歷鍵:
return_matches <- function(keys, potentials, fixed = TRUE) {
vapply(keys, function(k) {
paste(grep(k, potentials, value = TRUE, fixed = fixed), collapse = ", ")
}, FUN.VALUE = character(1))
}
vapply
只是sapply
的類型安全版本,這意味着它只會返回字符向量。 當您設置fixed = TRUE
時,function 將運行得更快,但不再識別正則表達式。 然后我們可以輕松制作所需的data.frame
:
df <- data.frame(
key = keys,
matches = return_matches(keys, potentials),
stringsAsFactors = FALSE
)
df
#> key matches
#> tiger tiger tigerINTHENIGHT, tigerWALKINGALONE
#> bear bear bearOHMY, bearWITHME
#> rat rat rat
將循環放在 function 中而不是直接運行它的原因只是為了使代碼看起來更干凈。
您可以使用grep
進行交互
> Match <- sapply(keys, function(item) {
paste0(grep(item, potentials, value = TRUE), collapse = ", ")
} )
> data.frame(keys, Match, row.names = NULL)
keys Match
1 tiger tigerINTHENIGHT, tigerWALKINGALONE
2 bear bearOHMY, bearWITHME
3 rat rat
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.