如何遍歷關鍵字向量列表並將它們模糊匹配到不同的文件 (R)

[英]How to run through list of keyword vectors and fuzzy match them to a different file (R)

我有兩個文件,一個全是關鍵字(大約 2,000 行),另一個全是文本(大約 770,000 行)。 關鍵字文件如下所示:

Event Name            Keyword
All-day tabby fest    tabby, all-day
All-day tabby fest    tabby, fest
Maine Coon Grooming   maine coon, groom    
Maine Coon Grooming   coon, groom

keywordFile <- tibble(EventName = c("All-day tabby fest", "All-day tabby fest", "Maine Coon Grooming","Maine Coon Grooming"), Keyword = c("tabby, all-day", "tabby, fest", "maine coon, groom", "coon, groom")


Bring your tabby to the fest on Tuesday
All cats are welcome to the fest on Tuesday
Mainecoon grooming will happen at noon Wednesday
Maine coons will be pampered at noon on Wednesday

text <- tibble(Description = c("Bring your tabby to the fest on Tuesday","All cats are welcome to the fest on Tuesday","Mainecoon grooming will happen at noon Wednesday","Maine coons will be pampered at noon on Wednesday")

我想要的是遍歷文本文件並查找模糊匹配(必須包括“關鍵字”列中的每個單詞)並返回顯示 TRUE 或 False 的新列。 如果那是真的,那么我想要第三列來顯示事件名稱。 所以看起來像:

Description                                          Match?   Event Name
Bring your tabby to the fest on Tuesday              TRUE     All-day tabby fest
All cats are welcome to the fest on Tuesday          FALSE
Mainecoon grooming will happen at noon Wednesday     TRUE     Maine Coon Grooming
Maine coons will be pampered at noon on Wednesday    FALSE

多虧了 Molx( How can I check if multiple strings exist in another string? ):

str <- c("tabby", "all-day")
myStr <- "Bring your tabby to the fest on Tuesday"
all(sapply(str, grepl, myStr))

但是,當我嘗試對整個文件進行模糊匹配時,我遇到了困難。 我試過這樣的事情:

for (i in seq_along(text$Description)){
  for (j in seq_along(keywordFile$EventName)) {
    # below I am creating the TRUE/FALSE column
    text$TF[i] <- all(sapply(keywordFile$Keyword[j], grepl, 
    if (isTRUE(text$TF))
      # below I am creating the EventName column
      text$EventName <- keywordFile$EventName

我不認為我在將正確的東西轉換為向量和字符串時遇到問題。 我的 keywordFile$Keyword 列是一堆字符串向量,我的 text$Description 列是一個字符串。 但是我正在為如何正確地遍歷這兩個文件而苦苦掙扎。 我得到的錯誤是

Error in ... replacement has 13 rows, data has 1


我不完全確定我明白了你的問題,因為我不會調用grepl()模糊匹配。 如果關鍵字位於更長的單詞中,它會更願意捕獲關鍵字。 所以“cat”和“catastrophe”將是一個匹配事件,認為這兩個詞非常不同。





dict <- tibble(Event_Name = c(
  "All-day tabby fest",
  "All-day tabby fest",
  "Maine Coon Grooming",
  "Maine Coon Grooming"
), Keyword = c(
  "tabby, all-day",
  "tabby, fest",
  "maine coon, groom",
  "coon, groom"
)) %>% 
  mutate(Keyword = strsplit(Keyword, ", ")) %>% 

string <- tibble(id = 1:4, Description = c(
  "Bring your tabby to the fest on Tuesday",
  "All cats are welcome to the fest on Tuesday",
  "Mainecoon grooming will happen at noon Wednesday",
  "Maine coons will be pampered at noon on Wednesday"


string_annotated <- string %>% 
  unnest_tokens(output = "word", input = Description) %>%
  stringdist_left_join(y = dict, by = c("word" = "Keyword"), max_dist = 1) %>% 
  mutate(match = !is.na(Keyword))

> string_annotated
# A tibble: 34 x 5
      id word    Event_Name         Keyword match
   <int> <chr>   <chr>              <chr>   <lgl>
 1     1 bring   NA                 NA      FALSE
 2     1 your    NA                 NA      FALSE
 3     1 tabby   All-day tabby fest tabby   TRUE 
 4     1 tabby   All-day tabby fest tabby   TRUE 
 5     1 to      NA                 NA      FALSE
 6     1 the     NA                 NA      FALSE
 7     1 fest    All-day tabby fest fest    TRUE 
 8     1 on      NA                 NA      FALSE
 9     1 tuesday NA                 NA      FALSE
10     2 all     NA                 NA      FALSE
# ... with 24 more rows

max_dist控制仍然構成匹配的內容。 在這種情況下,字符串之間的距離為1或更小可以找到所有文本的匹配項,但我也嘗試使用不匹配的字符串。


string_annotated_col <- string_annotated %>% 
  group_by(id) %>% 
  summarise(Description = paste(word, collapse = " "),
            match = sum(match),
            keywords = toString(unique(na.omit(Keyword))),
            Event_Name = toString(unique(na.omit(Event_Name))))

> string_annotated_col
# A tibble: 4 x 5
     id Description                                       match keywords         Event_Name         
  <int> <chr>                                             <int> <chr>            <chr>              
1     1 bring your tabby tabby to the fest on tuesday         3 tabby, fest      All-day tabby fest 
2     2 all cats are welcome to the fest on tuesday           1 fest             All-day tabby fest 
3     3 mainecoon grooming will happen at noon wednesday      2 maine coon, coon Maine Coon Grooming
4     4 maine coons will be pampered at noon on wednesday     2 coon             Maine Coon Grooming

如果部分答案對您沒有意義,請隨時提問。 其中一些解釋在這里 除了模糊匹配部分。

可以使用agrep()grepl()函數在 R 中進行近似匹配。 它適用於選項fixed=False 這些函數不需要任何額外的庫。


