获取R中两个单词之间的距离

Question

Say I have a line in a file: 假设我在文件中有一行：

string <- "thanks so much for your help all along. i'll let you know when...."

I want to return a value indicating if the word know is within 6 words of help . 我想返回一个值，该值指示“ know ”一词是否在help 6个词以内。

Answer 1

This is essentially a very crude implementation of Crayon's answer as a basic function: 本质上，这是Crayon答案作为基本功能的非常粗糙的实现：

withinRange <- function(string, term1, term2, threshold = 6) {
  x <- strsplit(string, " ")[[1]]
  abs(grep(term1, x) - grep(term2, x)) <= threshold
}

withinRange(string, "help", "know")
# [1] TRUE

withinRange(string, "thanks", "know")
# [1] FALSE

I would suggest getting a basic idea of the text tools available to you, and using them to write such a function. 我建议您对文本工具有一个基本的了解，并使用它们编写这样的功能。 Note Tyler's comment: As implemented, this can match multiple terms ("you" would match "you" and "your") leading to funny results. 注意Tyler的评论：实施后，它可以匹配多个术语（“ you”将匹配“ you”和“ your”），从而产生有趣的结果。 You'll need to determine how you want to deal with these cases to have a more useful function. 您需要确定如何处理这些情况以拥有更有用的功能。

Answer 2

you won't be able to get this from regex alone. 您将无法仅从正则表达式中获取此信息。 I suggest splitting using space as delimiter, then loop or use a built-in function to do array search of your two terms and subtract the difference of the indexes (array positions). 我建议使用空格作为定界符进行分割，然后循环或使用内置函数对两个词进行数组搜索，然后减去索引的差（数组位置）。

edit: Okay I thought about it a second and perhaps this will work for you as a regex pattern: 编辑：好吧，我想了一秒钟，也许这将为您工作的正则表达式模式：

\\bhelp(\\s+[^\\s]+){1,5}+\\s+know\\b

This takes the same "space is the delimiter" concept. 这采用相同的“空间是定界符”的概念。 First matches for help then greedily up to 5 " word" then looks for " know" (since "know" would be the 6th). 首先匹配帮助，然后贪婪地获取最多5个“单词”，然后寻找“知道”（因为“知道”将是第6个）。

Answer 3

Split your string: 分割字符串：

> words <- strsplit(string, '\\s')[[1]]

Build a indices vector: 建立索引向量：

> indices <- 1:length(words)

Name indices: 名称索引：

> names(indices) <- words

Compute distance between words: 计算单词之间的距离：

> abs(indices["help"] - indices["know"]) < 6
FALSE

EDIT In a function 编辑功能

 distance <- function(string, term1, term2) {
    words <- strsplit(string, "\\s")[[1]]
    indices <- 1:length(words)
    names(indices) <- words
    abs(indices[term1] - indices[term2])
 }

 distance(string, "help", "know") < 6

EDIT Plus 编辑加

There is a great advantage in indexing words, once its done you can work on a lot of statistics on a text. 索引单词有很大的优势，一旦完成索引，您就可以对文本进行大量统计。

获取R中两个单词之间的距离

问题描述

3 个解决方案

解决方案1
3 已采纳 2014-01-23 18:55:21

解决方案2
2 2014-01-23 18:51:48

解决方案3
0 2014-01-23 21:28:49

获取R中两个单词之间的距离

问题描述

3 个解决方案

解决方案1 3 已采纳 2014-01-23 18:55:21

解决方案2 2 2014-01-23 18:51:48

解决方案3 0 2014-01-23 21:28:49

解决方案1
3 已采纳 2014-01-23 18:55:21

解决方案2
2 2014-01-23 18:51:48

解决方案3
0 2014-01-23 21:28:49