简体   繁体   English

如何遍历关键字向量列表并将它们模糊匹配到不同的文件 (R)

[英]How to run through list of keyword vectors and fuzzy match them to a different file (R)

I have two files, one is full of keywords (roughly 2,000 rows) and the other is full of text (roughly 770,000 rows).我有两个文件,一个全是关键字(大约 2,000 行),另一个全是文本(大约 770,000 行)。 The keyword file looks like:关键字文件如下所示:

Event Name            Keyword
All-day tabby fest    tabby, all-day
All-day tabby fest    tabby, fest
Maine Coon Grooming   maine coon, groom    
Maine Coon Grooming   coon, groom

keywordFile <- tibble(EventName = c("All-day tabby fest", "All-day tabby fest", "Maine Coon Grooming","Maine Coon Grooming"), Keyword = c("tabby, all-day", "tabby, fest", "maine coon, groom", "coon, groom")

The text file looks like:文本文件如下所示:

Description
Bring your tabby to the fest on Tuesday
All cats are welcome to the fest on Tuesday
Mainecoon grooming will happen at noon Wednesday
Maine coons will be pampered at noon on Wednesday

text <- tibble(Description = c("Bring your tabby to the fest on Tuesday","All cats are welcome to the fest on Tuesday","Mainecoon grooming will happen at noon Wednesday","Maine coons will be pampered at noon on Wednesday")

What I want is to iterate through the text file and look for fuzzy matches (must include each word in the "Keyword" column) and return a new column that displays TRUE or False.我想要的是遍历文本文件并查找模糊匹配(必须包括“关键字”列中的每个单词)并返回显示 TRUE 或 False 的新列。 If that is TRUE, then I want a third column to display the event name.如果那是真的,那么我想要第三列来显示事件名称。 So something that looks like:所以看起来像:

Description                                          Match?   Event Name
Bring your tabby to the fest on Tuesday              TRUE     All-day tabby fest
All cats are welcome to the fest on Tuesday          FALSE
Mainecoon grooming will happen at noon Wednesday     TRUE     Maine Coon Grooming
Maine coons will be pampered at noon on Wednesday    FALSE

I am able to successfully do my fuzzy matches (after converting everything to lowercase) with stuff like this, thanks to Molx ( How can I check if multiple strings exist in another string? ):多亏了 Molx( How can I check if multiple strings exist in another string? ):

str <- c("tabby", "all-day")
myStr <- "Bring your tabby to the fest on Tuesday"
all(sapply(str, grepl, myStr))

However, I am getting stuck when I try to fuzzy match the whole files.但是,当我尝试对整个文件进行模糊匹配时,我遇到了困难。 I tried something like this:我试过这样的事情:

for (i in seq_along(text$Description)){
  for (j in seq_along(keywordFile$EventName)) {
    # below I am creating the TRUE/FALSE column
    text$TF[i] <- all(sapply(keywordFile$Keyword[j], grepl, 
                                                     text$Description[i]))
    if (isTRUE(text$TF))
      # below I am creating the EventName column
      text$EventName <- keywordFile$EventName
    }
}

I don't think I'm having trouble converting the right things to vectors and strings.我不认为我在将正确的东西转换为向量和字符串时遇到问题。 My keywordFile$Keyword column is a bunch of string vectors and my text$Description column is a character string.我的 keywordFile$Keyword 列是一堆字符串向量,我的 text$Description 列是一个字符串。 But I'm struggling with how to iterate properly through both files.但是我正在为如何正确地遍历这两个文件而苦苦挣扎。 The error I'm getting is我得到的错误是

Error in ... replacement has 13 rows, data has 1

Has anyone done anything like this before?以前有人做过这样的事吗?

I'm not completely sure I get your question, as I wouldn't call grepl() fuzzy matching.我不完全确定我明白了你的问题,因为我不会调用grepl()模糊匹配。 It will rather catch the keyword if it is inside a longer word.如果关键字位于更长的单词中,它会更愿意捕获关键字。 So "cat" and "catastrophe" would be a match event thought these words are very different.所以“cat”和“catastrophe”将是一个匹配事件,认为这两个词非常不同。

I chose instead to write an answer were you can control the distance between strings that stil constitute a match:我选择写一个答案,如果你可以控制仍然构成匹配的字符串之间的距离:

Load libraries:加载库:

library(tibble)
library(dplyr)
library(fuzzyjoin)
library(tidytext)
library(tidyr)

Make dictionary and data object:制作字典和数据对象:

dict <- tibble(Event_Name = c(
  "All-day tabby fest",
  "All-day tabby fest",
  "Maine Coon Grooming",
  "Maine Coon Grooming"
), Keyword = c(
  "tabby, all-day",
  "tabby, fest",
  "maine coon, groom",
  "coon, groom"
)) %>% 
  mutate(Keyword = strsplit(Keyword, ", ")) %>% 
  unnest(Keyword)

string <- tibble(id = 1:4, Description = c(
  "Bring your tabby to the fest on Tuesday",
  "All cats are welcome to the fest on Tuesday",
  "Mainecoon grooming will happen at noon Wednesday",
  "Maine coons will be pampered at noon on Wednesday"
))

Apply dictionary to data:将字典应用于数据:

string_annotated <- string %>% 
  unnest_tokens(output = "word", input = Description) %>%
  stringdist_left_join(y = dict, by = c("word" = "Keyword"), max_dist = 1) %>% 
  mutate(match = !is.na(Keyword))

> string_annotated
# A tibble: 34 x 5
      id word    Event_Name         Keyword match
   <int> <chr>   <chr>              <chr>   <lgl>
 1     1 bring   NA                 NA      FALSE
 2     1 your    NA                 NA      FALSE
 3     1 tabby   All-day tabby fest tabby   TRUE 
 4     1 tabby   All-day tabby fest tabby   TRUE 
 5     1 to      NA                 NA      FALSE
 6     1 the     NA                 NA      FALSE
 7     1 fest    All-day tabby fest fest    TRUE 
 8     1 on      NA                 NA      FALSE
 9     1 tuesday NA                 NA      FALSE
10     2 all     NA                 NA      FALSE
# ... with 24 more rows

max_dist controls what still constitutes a match. max_dist控制仍然构成匹配的内容。 A distance between strings of 1 or less in this case finds a match for all texts, but I tried it with a no-match string as well.在这种情况下,字符串之间的距离为1或更小可以找到所有文本的匹配项,但我也尝试使用不匹配的字符串。

If you want to get this long format back into the original:如果您想将此长格式恢复为原始格式:

string_annotated_col <- string_annotated %>% 
  group_by(id) %>% 
  summarise(Description = paste(word, collapse = " "),
            match = sum(match),
            keywords = toString(unique(na.omit(Keyword))),
            Event_Name = toString(unique(na.omit(Event_Name))))

> string_annotated_col
# A tibble: 4 x 5
     id Description                                       match keywords         Event_Name         
  <int> <chr>                                             <int> <chr>            <chr>              
1     1 bring your tabby tabby to the fest on tuesday         3 tabby, fest      All-day tabby fest 
2     2 all cats are welcome to the fest on tuesday           1 fest             All-day tabby fest 
3     3 mainecoon grooming will happen at noon wednesday      2 maine coon, coon Maine Coon Grooming
4     4 maine coons will be pampered at noon on wednesday     2 coon             Maine Coon Grooming

Feel free to ask questions if a part of the answer doesn't make sense to you.如果部分答案对您没有意义,请随时提问。 Some of it is explained in here .其中一些解释在这里 Except the fuzzy matching part.除了模糊匹配部分。

Approximate matching can be done in R using agrep() or grepl() functions.可以使用agrep()grepl()函数在 R 中进行近似匹配。 It works with option fixed=False .它适用于选项fixed=False You do not need any additional libraries for these functions.这些函数不需要任何额外的库。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM