简体   繁体   English

查找字符串和查找表之间所有可能的短语匹配

[英]Find all possible phrase matches between string and lookup table

I have a data frame with a bunch of text strings.我有一个带有一堆文本字符串的数据框。 In a second data frame I have a list of phrases that I'm using as a lookup table.在第二个数据框中,我有一个短语列表,用作查找表。 I want to search the text strings for all possible phrase matches in the lookup table.我想在查找表中搜索所有可能的短语匹配的文本字符串。

My problem is that some of the phrases have overlapping words.我的问题是某些短语有重叠的单词。 For example: "eggs" and "green eggs".例如:“鸡蛋”和“绿色鸡蛋”。

library(udpipe)
library(dplyr)

# Download english dictionary
ud_model <- udpipe_download_model(language = "english")
ud_model <- udpipe_load_model(ud_model$file_model)

# Create example data
sample <- data.frame(doc_id = 1, text = "the cat in the hat ate green eggs and ham")
phrases <- data.frame(phrase = c("cat", "hat", "eggs", "green eggs", "ham", "the cat"))

# Tokenize text
x <- udpipe_annotate(ud_model, x = sample$text, doc_id = sample$doc_id)
x <- as.data.frame(x)
x$token <- tolower(x$token)

test_results <- x %>% select(doc_id, token)
test_results$term <- txt_recode_ngram(test_results$token, 
                                 compound = phrases$phrase, 
                                 ngram = str_count(phrases$phrase, '\\w+'), 
                                 sep = " ")

# Remove any tokens that don't match a phrase in the lookup table
test_results <- filter(test_results, term %in% phrases$phrase)

In the results you can see that "the cat" is returned but not "cat", "green eggs" but not "eggs".在结果中,您可以看到返回的是“the cat”而不是“cat”,返回的是“green eggs”而不是“eggs”。

> test_results$term
[1] "the cat"    "hat"        "green eggs" "ham" 

How can I find all possible phrase matches between a text string and a lookup table?如何在文本字符串和查找表之间找到所有可能的短语匹配?

I should add that I'm not wedded to any particular package.我应该补充一点,我不喜欢任何特定的 package。 I'm just using udpipe here because I'm most familiar with it.我只是在这里使用 udpipe,因为我最熟悉它。

I think you can simply use grepl to match if a string is inside another one.我认为如果一个字符串在另一个字符串中,您可以简单地使用grepl来匹配。 From that you apply grepl to all other matching patterns从那里您apply grepl应用于所有其他匹配模式

# Create example data
sample <- data.frame(doc_id = 1, text = "the cat in the hat ate green eggs and ham")
phrases <- data.frame(phrase = c("cat", "hat", "eggs", "green eggs", "ham", "the cat"))

apply(phrases, 1, grepl,sample$text)

And if you want your matches you can just:如果你想要你的比赛,你可以:

phrases[apply(phrases, 1, grepl,sample$text),]

But maybe a dataframe type is not the most relevant for phrases但也许dataframe类型与短语不是最相关的

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 R-使用正则表达式在字符串中查找与查找字段匹配的名称 - R - find name in string that matches a lookup field using regex R 逐步搜索分层查找表的字符串以查找匹配项 - R progressively search string of a hierarchical lookup table for matches 在 R 中的两个字符串列之间查找匹配项 - Find matches between two string columns in R R中的agrep - 在字符串中查找* all *匹配(全局标志) - agrep in R - find *all* matches in a string (global flag) R查找data.table数据和data.table随机值之间的匹配 - R Find matches between a data.table of data and a data.table of random values 用查询表中的匹配项替换数据框中的每一列 - Replace each column within a dataframe with matches from a lookup table 在 r 中查找具有不精确匹配和矢量化代码的查找表中的行索引 - Finding row index in lookup table with inexact matches and vectorized code in r 如何在 R 中生成一个可呈现的表,该表对不同列中的字​​符串和数字之间的匹配进行计数/求和 - How to generate a presentable table in R that counts/sums matches between string and numeric in different columns ifelse带有可能的字符串匹配列表,仅此而已 - ifelse with list of possible string matches and no else For循环查找两个列表之间的匹配项(R) - For loop to find matches between two lists (R)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM