[英]How to run through list of keyword vectors and fuzzy match them to a different file (R)
我有两个文件,一个全是关键字(大约 2,000 行),另一个全是文本(大约 770,000 行)。 关键字文件如下所示:
Event Name Keyword
All-day tabby fest tabby, all-day
All-day tabby fest tabby, fest
Maine Coon Grooming maine coon, groom
Maine Coon Grooming coon, groom
keywordFile <- tibble(EventName = c("All-day tabby fest", "All-day tabby fest", "Maine Coon Grooming","Maine Coon Grooming"), Keyword = c("tabby, all-day", "tabby, fest", "maine coon, groom", "coon, groom")
文本文件如下所示:
Description
Bring your tabby to the fest on Tuesday
All cats are welcome to the fest on Tuesday
Mainecoon grooming will happen at noon Wednesday
Maine coons will be pampered at noon on Wednesday
text <- tibble(Description = c("Bring your tabby to the fest on Tuesday","All cats are welcome to the fest on Tuesday","Mainecoon grooming will happen at noon Wednesday","Maine coons will be pampered at noon on Wednesday")
我想要的是遍历文本文件并查找模糊匹配(必须包括“关键字”列中的每个单词)并返回显示 TRUE 或 False 的新列。 如果那是真的,那么我想要第三列来显示事件名称。 所以看起来像:
Description Match? Event Name
Bring your tabby to the fest on Tuesday TRUE All-day tabby fest
All cats are welcome to the fest on Tuesday FALSE
Mainecoon grooming will happen at noon Wednesday TRUE Maine Coon Grooming
Maine coons will be pampered at noon on Wednesday FALSE
多亏了 Molx( How can I check if multiple strings exist in another string? ):
str <- c("tabby", "all-day")
myStr <- "Bring your tabby to the fest on Tuesday"
all(sapply(str, grepl, myStr))
但是,当我尝试对整个文件进行模糊匹配时,我遇到了困难。 我试过这样的事情:
for (i in seq_along(text$Description)){
for (j in seq_along(keywordFile$EventName)) {
# below I am creating the TRUE/FALSE column
text$TF[i] <- all(sapply(keywordFile$Keyword[j], grepl,
text$Description[i]))
if (isTRUE(text$TF))
# below I am creating the EventName column
text$EventName <- keywordFile$EventName
}
}
我不认为我在将正确的东西转换为向量和字符串时遇到问题。 我的 keywordFile$Keyword 列是一堆字符串向量,我的 text$Description 列是一个字符串。 但是我正在为如何正确地遍历这两个文件而苦苦挣扎。 我得到的错误是
Error in ... replacement has 13 rows, data has 1
以前有人做过这样的事吗?
我不完全确定我明白了你的问题,因为我不会调用grepl()
模糊匹配。 如果关键字位于更长的单词中,它会更愿意捕获关键字。 所以“cat”和“catastrophe”将是一个匹配事件,认为这两个词非常不同。
我选择写一个答案,如果你可以控制仍然构成匹配的字符串之间的距离:
加载库:
library(tibble)
library(dplyr)
library(fuzzyjoin)
library(tidytext)
library(tidyr)
制作字典和数据对象:
dict <- tibble(Event_Name = c(
"All-day tabby fest",
"All-day tabby fest",
"Maine Coon Grooming",
"Maine Coon Grooming"
), Keyword = c(
"tabby, all-day",
"tabby, fest",
"maine coon, groom",
"coon, groom"
)) %>%
mutate(Keyword = strsplit(Keyword, ", ")) %>%
unnest(Keyword)
string <- tibble(id = 1:4, Description = c(
"Bring your tabby to the fest on Tuesday",
"All cats are welcome to the fest on Tuesday",
"Mainecoon grooming will happen at noon Wednesday",
"Maine coons will be pampered at noon on Wednesday"
))
将字典应用于数据:
string_annotated <- string %>%
unnest_tokens(output = "word", input = Description) %>%
stringdist_left_join(y = dict, by = c("word" = "Keyword"), max_dist = 1) %>%
mutate(match = !is.na(Keyword))
> string_annotated
# A tibble: 34 x 5
id word Event_Name Keyword match
<int> <chr> <chr> <chr> <lgl>
1 1 bring NA NA FALSE
2 1 your NA NA FALSE
3 1 tabby All-day tabby fest tabby TRUE
4 1 tabby All-day tabby fest tabby TRUE
5 1 to NA NA FALSE
6 1 the NA NA FALSE
7 1 fest All-day tabby fest fest TRUE
8 1 on NA NA FALSE
9 1 tuesday NA NA FALSE
10 2 all NA NA FALSE
# ... with 24 more rows
max_dist
控制仍然构成匹配的内容。 在这种情况下,字符串之间的距离为1
或更小可以找到所有文本的匹配项,但我也尝试使用不匹配的字符串。
如果您想将此长格式恢复为原始格式:
string_annotated_col <- string_annotated %>%
group_by(id) %>%
summarise(Description = paste(word, collapse = " "),
match = sum(match),
keywords = toString(unique(na.omit(Keyword))),
Event_Name = toString(unique(na.omit(Event_Name))))
> string_annotated_col
# A tibble: 4 x 5
id Description match keywords Event_Name
<int> <chr> <int> <chr> <chr>
1 1 bring your tabby tabby to the fest on tuesday 3 tabby, fest All-day tabby fest
2 2 all cats are welcome to the fest on tuesday 1 fest All-day tabby fest
3 3 mainecoon grooming will happen at noon wednesday 2 maine coon, coon Maine Coon Grooming
4 4 maine coons will be pampered at noon on wednesday 2 coon Maine Coon Grooming
如果部分答案对您没有意义,请随时提问。 其中一些解释在这里。 除了模糊匹配部分。
可以使用agrep()
或grepl()
函数在 R 中进行近似匹配。 它适用于选项fixed=False
。 这些函数不需要任何额外的库。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.