[英]Getting distance between two words in R
Say I have a line in a file: 假设我在文件中有一行:
string <- "thanks so much for your help all along. i'll let you know when...."
I want to return a value indicating if the word know
is within 6 words of help
. 我想返回一个值,该值指示“
know
”一词是否在help
6个词以内。
This is essentially a very crude implementation of Crayon's answer as a basic function: 本质上,这是Crayon答案作为基本功能的非常粗糙的实现:
withinRange <- function(string, term1, term2, threshold = 6) {
x <- strsplit(string, " ")[[1]]
abs(grep(term1, x) - grep(term2, x)) <= threshold
}
withinRange(string, "help", "know")
# [1] TRUE
withinRange(string, "thanks", "know")
# [1] FALSE
I would suggest getting a basic idea of the text tools available to you, and using them to write such a function. 我建议您对文本工具有一个基本的了解,并使用它们编写这样的功能。 Note Tyler's comment: As implemented, this can match multiple terms ("you" would match "you" and "your") leading to funny results.
注意Tyler的评论:实施后,它可以匹配多个术语(“ you”将匹配“ you”和“ your”),从而产生有趣的结果。 You'll need to determine how you want to deal with these cases to have a more useful function.
您需要确定如何处理这些情况以拥有更有用的功能。
you won't be able to get this from regex alone. 您将无法仅从正则表达式中获取此信息。 I suggest splitting using space as delimiter, then loop or use a built-in function to do array search of your two terms and subtract the difference of the indexes (array positions).
我建议使用空格作为定界符进行分割,然后循环或使用内置函数对两个词进行数组搜索,然后减去索引的差(数组位置)。
edit: Okay I thought about it a second and perhaps this will work for you as a regex pattern: 编辑:好吧,我想了一秒钟,也许这将为您工作的正则表达式模式:
\\bhelp(\\s+[^\\s]+){1,5}+\\s+know\\b
This takes the same "space is the delimiter" concept. 这采用相同的“空间是定界符”的概念。 First matches for help then greedily up to 5 " word" then looks for " know" (since "know" would be the 6th).
首先匹配帮助,然后贪婪地获取最多5个“单词”,然后寻找“知道”(因为“知道”将是第6个)。
Split your string: 分割字符串:
> words <- strsplit(string, '\\s')[[1]]
Build a indices vector: 建立索引向量:
> indices <- 1:length(words)
Name indices: 名称索引:
> names(indices) <- words
Compute distance between words: 计算单词之间的距离:
> abs(indices["help"] - indices["know"]) < 6
FALSE
EDIT In a function 编辑功能
distance <- function(string, term1, term2) {
words <- strsplit(string, "\\s")[[1]]
indices <- 1:length(words)
names(indices) <- words
abs(indices[term1] - indices[term2])
}
distance(string, "help", "know") < 6
EDIT Plus 编辑加
There is a great advantage in indexing words, once its done you can work on a lot of statistics on a text. 索引单词有很大的优势,一旦完成索引,您就可以对文本进行大量统计。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.