[英]Count number of words match in phrase
I have two big list of phrases. 我有两个大的短语清单。 I need to check the percentage of words exist in the other list and get best result out of other list.
我需要检查其他列表中存在的单词的百分比,并从其他列表中获得最佳结果。
A <- data.frame(name = c(
"X-ray right leg arteries",
"x-ray left shoulder",
"x-ray leg arteries",
"x-ray leg with 20km distance"
), stringsAsFactors = F)
B <- data.frame(name = c(
"X-ray left leg arteries",
"X-ray leg",
"xray right leg",
"X-ray right leg arteries"
), stringsAsFactors = F)
fuzzy_prep_words <- function(words) {
words <- unlist(strsplit(tolower(gsub("[[:punct:]]", "", words)), "\\W+"))
return(words)
}
fuzzy_prep_words(A$name)
fuzzy_prep_words(B$name)
I am able to extract the words from the list but not able to calculate the number and proportion of words matched in the other list. 我能够从列表中提取单词,但无法计算其他列表中匹配单词的数量和比例。
"X-ray right leg arteries" has exact match in B so it should return two columns - Match : ""X-ray right leg arteries" and Distance = 100%. For second phrase - "x-ray left shoulder", it should return match - "X-ray left leg arteries" and distance 66.67% as 2 words matched out of 3 words in "x-ray left shoulder". For 3rd phrase, it should return any of "X-ray left leg arteries", "X-ray right leg arteries". “ X射线右腿动脉”与B完全匹配,因此应返回两列-匹配:““ X射线右腿动脉”且距离= 100%。第二个短语-“ X射线左肩”,它应该返回匹配项“ X射线左腿动脉”,并且距离66.67%,因为“ X射线左肩动脉”中3个单词中有2个单词匹配。对于第三个短语,它应该返回“ X射线左腿动脉”中的任何一个,“ X射线右腿动脉”。
I have already explored string distance algorithms such as LV, COSINE, LCS so I don't want to use it as I have big phrases in my real dataset. 我已经研究过字符串距离算法,例如LV,COSINE,LCS,所以我不想使用它,因为我的真实数据集中有很多短语。
How about something like this? 这样的事情怎么样?
m <- lapply(strsplit(tolower(gsub("[[:punct:]]", "", A$name)), " "), function(w1)
do.call(rbind.data.frame, lapply(strsplit(tolower(gsub("[[:punct:]]", "", B$name)), " "), function(w2) {
cbind.data.frame(
matches_string_from_B = paste(w2, collapse = " "),
percentage = sum(w1 %in% w2) / length(w1) * 100)
}
))
)
names(m) <- tolower(gsub("[[:punct:]]", "", A$name));
m;
$`xray right leg arteries`
matches_string_from_B percentage
1 xray left leg arteries 75
2 xray leg 50
3 xray right leg 75
4 xray right leg arteries 100
$`xray left shoulder`
matches_string_from_B percentage
1 xray left leg arteries 66.66667
2 xray leg 33.33333
3 xray right leg 33.33333
4 xray right leg arteries 33.33333
$`xray leg arteries`
matches_string_from_B percentage
1 xray left leg arteries 100.00000
2 xray leg 66.66667
3 xray right leg 66.66667
4 xray right leg arteries 100.00000
$`xray leg with 20km distance`
matches_string_from_B percentage
1 xray left leg arteries 40
2 xray leg 40
3 xray right leg 40
4 xray right leg arteries 40
Explanation: Split entries from A$name
into words, calculate percentage of matching words from split entries from B$name
, and store in list of dataframes
. 说明:
A$name
条目拆分为单词,然后根据B$name
拆分条目计算匹配单词的百分比,并将其存储在dataframes
列表中。 Use toupper
and gsub("[[:punct:]]", "", ...)
to make matching case insensitive and ignore punctuation characters. 使用
toupper
和gsub("[[:punct:]]", "", ...)
使匹配的大小写不敏感,并忽略标点符号。
To get the best match (percentage-wise) you can do: 要获得最佳匹配(按百分比),您可以执行以下操作:
do.call(rbind.data.frame, lapply(m, function(x) x[which.max(x$percentage), ]))
# matches_string_from_B percentage
#xray right leg arteries xray right leg arteries 100.00000
#xray left shoulder xray left leg arteries 66.66667
#xray leg arteries xray left leg arteries 100.00000
#xray leg with 20km distance xray left leg arteries 40.00000
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.