繁体   English   中英

如何遍历关键字向量列表并将它们模糊匹配到不同的文件 (R)

[英]How to run through list of keyword vectors and fuzzy match them to a different file (R)

我有两个文件,一个全是关键字(大约 2,000 行),另一个全是文本(大约 770,000 行)。 关键字文件如下所示:

Event Name            Keyword
All-day tabby fest    tabby, all-day
All-day tabby fest    tabby, fest
Maine Coon Grooming   maine coon, groom    
Maine Coon Grooming   coon, groom

keywordFile <- tibble(EventName = c("All-day tabby fest", "All-day tabby fest", "Maine Coon Grooming","Maine Coon Grooming"), Keyword = c("tabby, all-day", "tabby, fest", "maine coon, groom", "coon, groom")

文本文件如下所示:

Description
Bring your tabby to the fest on Tuesday
All cats are welcome to the fest on Tuesday
Mainecoon grooming will happen at noon Wednesday
Maine coons will be pampered at noon on Wednesday

text <- tibble(Description = c("Bring your tabby to the fest on Tuesday","All cats are welcome to the fest on Tuesday","Mainecoon grooming will happen at noon Wednesday","Maine coons will be pampered at noon on Wednesday")

我想要的是遍历文本文件并查找模糊匹配(必须包括“关键字”列中的每个单词)并返回显示 TRUE 或 False 的新列。 如果那是真的,那么我想要第三列来显示事件名称。 所以看起来像:

Description                                          Match?   Event Name
Bring your tabby to the fest on Tuesday              TRUE     All-day tabby fest
All cats are welcome to the fest on Tuesday          FALSE
Mainecoon grooming will happen at noon Wednesday     TRUE     Maine Coon Grooming
Maine coons will be pampered at noon on Wednesday    FALSE

多亏了 Molx( How can I check if multiple strings exist in another string? ):

str <- c("tabby", "all-day")
myStr <- "Bring your tabby to the fest on Tuesday"
all(sapply(str, grepl, myStr))

但是,当我尝试对整个文件进行模糊匹配时,我遇到了困难。 我试过这样的事情:

for (i in seq_along(text$Description)){
  for (j in seq_along(keywordFile$EventName)) {
    # below I am creating the TRUE/FALSE column
    text$TF[i] <- all(sapply(keywordFile$Keyword[j], grepl, 
                                                     text$Description[i]))
    if (isTRUE(text$TF))
      # below I am creating the EventName column
      text$EventName <- keywordFile$EventName
    }
}

我不认为我在将正确的东西转换为向量和字符串时遇到问题。 我的 keywordFile$Keyword 列是一堆字符串向量,我的 text$Description 列是一个字符串。 但是我正在为如何正确地遍历这两个文件而苦苦挣扎。 我得到的错误是

Error in ... replacement has 13 rows, data has 1

以前有人做过这样的事吗?

我不完全确定我明白了你的问题,因为我不会调用grepl()模糊匹配。 如果关键字位于更长的单词中,它会更愿意捕获关键字。 所以“cat”和“catastrophe”将是一个匹配事件,认为这两个词非常不同。

我选择写一个答案,如果你可以控制仍然构成匹配的字符串之间的距离:

加载库:

library(tibble)
library(dplyr)
library(fuzzyjoin)
library(tidytext)
library(tidyr)

制作字典和数据对象:

dict <- tibble(Event_Name = c(
  "All-day tabby fest",
  "All-day tabby fest",
  "Maine Coon Grooming",
  "Maine Coon Grooming"
), Keyword = c(
  "tabby, all-day",
  "tabby, fest",
  "maine coon, groom",
  "coon, groom"
)) %>% 
  mutate(Keyword = strsplit(Keyword, ", ")) %>% 
  unnest(Keyword)

string <- tibble(id = 1:4, Description = c(
  "Bring your tabby to the fest on Tuesday",
  "All cats are welcome to the fest on Tuesday",
  "Mainecoon grooming will happen at noon Wednesday",
  "Maine coons will be pampered at noon on Wednesday"
))

将字典应用于数据:

string_annotated <- string %>% 
  unnest_tokens(output = "word", input = Description) %>%
  stringdist_left_join(y = dict, by = c("word" = "Keyword"), max_dist = 1) %>% 
  mutate(match = !is.na(Keyword))

> string_annotated
# A tibble: 34 x 5
      id word    Event_Name         Keyword match
   <int> <chr>   <chr>              <chr>   <lgl>
 1     1 bring   NA                 NA      FALSE
 2     1 your    NA                 NA      FALSE
 3     1 tabby   All-day tabby fest tabby   TRUE 
 4     1 tabby   All-day tabby fest tabby   TRUE 
 5     1 to      NA                 NA      FALSE
 6     1 the     NA                 NA      FALSE
 7     1 fest    All-day tabby fest fest    TRUE 
 8     1 on      NA                 NA      FALSE
 9     1 tuesday NA                 NA      FALSE
10     2 all     NA                 NA      FALSE
# ... with 24 more rows

max_dist控制仍然构成匹配的内容。 在这种情况下,字符串之间的距离为1或更小可以找到所有文本的匹配项,但我也尝试使用不匹配的字符串。

如果您想将此长格式恢复为原始格式:

string_annotated_col <- string_annotated %>% 
  group_by(id) %>% 
  summarise(Description = paste(word, collapse = " "),
            match = sum(match),
            keywords = toString(unique(na.omit(Keyword))),
            Event_Name = toString(unique(na.omit(Event_Name))))

> string_annotated_col
# A tibble: 4 x 5
     id Description                                       match keywords         Event_Name         
  <int> <chr>                                             <int> <chr>            <chr>              
1     1 bring your tabby tabby to the fest on tuesday         3 tabby, fest      All-day tabby fest 
2     2 all cats are welcome to the fest on tuesday           1 fest             All-day tabby fest 
3     3 mainecoon grooming will happen at noon wednesday      2 maine coon, coon Maine Coon Grooming
4     4 maine coons will be pampered at noon on wednesday     2 coon             Maine Coon Grooming

如果部分答案对您没有意义,请随时提问。 其中一些解释在这里 除了模糊匹配部分。

可以使用agrep()grepl()函数在 R 中进行近似匹配。 它适用于选项fixed=False 这些函数不需要任何额外的库。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM