[英]How to run through list of keyword vectors and fuzzy match them to a different file (R)
我有兩個文件,一個全是關鍵字(大約 2,000 行),另一個全是文本(大約 770,000 行)。 關鍵字文件如下所示:
Event Name Keyword
All-day tabby fest tabby, all-day
All-day tabby fest tabby, fest
Maine Coon Grooming maine coon, groom
Maine Coon Grooming coon, groom
keywordFile <- tibble(EventName = c("All-day tabby fest", "All-day tabby fest", "Maine Coon Grooming","Maine Coon Grooming"), Keyword = c("tabby, all-day", "tabby, fest", "maine coon, groom", "coon, groom")
文本文件如下所示:
Description
Bring your tabby to the fest on Tuesday
All cats are welcome to the fest on Tuesday
Mainecoon grooming will happen at noon Wednesday
Maine coons will be pampered at noon on Wednesday
text <- tibble(Description = c("Bring your tabby to the fest on Tuesday","All cats are welcome to the fest on Tuesday","Mainecoon grooming will happen at noon Wednesday","Maine coons will be pampered at noon on Wednesday")
我想要的是遍歷文本文件並查找模糊匹配(必須包括“關鍵字”列中的每個單詞)並返回顯示 TRUE 或 False 的新列。 如果那是真的,那么我想要第三列來顯示事件名稱。 所以看起來像:
Description Match? Event Name
Bring your tabby to the fest on Tuesday TRUE All-day tabby fest
All cats are welcome to the fest on Tuesday FALSE
Mainecoon grooming will happen at noon Wednesday TRUE Maine Coon Grooming
Maine coons will be pampered at noon on Wednesday FALSE
多虧了 Molx( How can I check if multiple strings exist in another string? ):
str <- c("tabby", "all-day")
myStr <- "Bring your tabby to the fest on Tuesday"
all(sapply(str, grepl, myStr))
但是,當我嘗試對整個文件進行模糊匹配時,我遇到了困難。 我試過這樣的事情:
for (i in seq_along(text$Description)){
for (j in seq_along(keywordFile$EventName)) {
# below I am creating the TRUE/FALSE column
text$TF[i] <- all(sapply(keywordFile$Keyword[j], grepl,
text$Description[i]))
if (isTRUE(text$TF))
# below I am creating the EventName column
text$EventName <- keywordFile$EventName
}
}
我不認為我在將正確的東西轉換為向量和字符串時遇到問題。 我的 keywordFile$Keyword 列是一堆字符串向量,我的 text$Description 列是一個字符串。 但是我正在為如何正確地遍歷這兩個文件而苦苦掙扎。 我得到的錯誤是
Error in ... replacement has 13 rows, data has 1
以前有人做過這樣的事嗎?
我不完全確定我明白了你的問題,因為我不會調用grepl()
模糊匹配。 如果關鍵字位於更長的單詞中,它會更願意捕獲關鍵字。 所以“cat”和“catastrophe”將是一個匹配事件,認為這兩個詞非常不同。
我選擇寫一個答案,如果你可以控制仍然構成匹配的字符串之間的距離:
加載庫:
library(tibble)
library(dplyr)
library(fuzzyjoin)
library(tidytext)
library(tidyr)
制作字典和數據對象:
dict <- tibble(Event_Name = c(
"All-day tabby fest",
"All-day tabby fest",
"Maine Coon Grooming",
"Maine Coon Grooming"
), Keyword = c(
"tabby, all-day",
"tabby, fest",
"maine coon, groom",
"coon, groom"
)) %>%
mutate(Keyword = strsplit(Keyword, ", ")) %>%
unnest(Keyword)
string <- tibble(id = 1:4, Description = c(
"Bring your tabby to the fest on Tuesday",
"All cats are welcome to the fest on Tuesday",
"Mainecoon grooming will happen at noon Wednesday",
"Maine coons will be pampered at noon on Wednesday"
))
將字典應用於數據:
string_annotated <- string %>%
unnest_tokens(output = "word", input = Description) %>%
stringdist_left_join(y = dict, by = c("word" = "Keyword"), max_dist = 1) %>%
mutate(match = !is.na(Keyword))
> string_annotated
# A tibble: 34 x 5
id word Event_Name Keyword match
<int> <chr> <chr> <chr> <lgl>
1 1 bring NA NA FALSE
2 1 your NA NA FALSE
3 1 tabby All-day tabby fest tabby TRUE
4 1 tabby All-day tabby fest tabby TRUE
5 1 to NA NA FALSE
6 1 the NA NA FALSE
7 1 fest All-day tabby fest fest TRUE
8 1 on NA NA FALSE
9 1 tuesday NA NA FALSE
10 2 all NA NA FALSE
# ... with 24 more rows
max_dist
控制仍然構成匹配的內容。 在這種情況下,字符串之間的距離為1
或更小可以找到所有文本的匹配項,但我也嘗試使用不匹配的字符串。
如果您想將此長格式恢復為原始格式:
string_annotated_col <- string_annotated %>%
group_by(id) %>%
summarise(Description = paste(word, collapse = " "),
match = sum(match),
keywords = toString(unique(na.omit(Keyword))),
Event_Name = toString(unique(na.omit(Event_Name))))
> string_annotated_col
# A tibble: 4 x 5
id Description match keywords Event_Name
<int> <chr> <int> <chr> <chr>
1 1 bring your tabby tabby to the fest on tuesday 3 tabby, fest All-day tabby fest
2 2 all cats are welcome to the fest on tuesday 1 fest All-day tabby fest
3 3 mainecoon grooming will happen at noon wednesday 2 maine coon, coon Maine Coon Grooming
4 4 maine coons will be pampered at noon on wednesday 2 coon Maine Coon Grooming
如果部分答案對您沒有意義,請隨時提問。 其中一些解釋在這里。 除了模糊匹配部分。
可以使用agrep()
或grepl()
函數在 R 中進行近似匹配。 它適用於選項fixed=False
。 這些函數不需要任何額外的庫。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.