简体   繁体   English

获取R中两个单词之间的距离

[英]Getting distance between two words in R

Say I have a line in a file: 假设我在文件中有一行:

string <- "thanks so much for your help all along. i'll let you know when...."

I want to return a value indicating if the word know is within 6 words of help . 我想返回一个值,该值指示“ know ”一词是否在help 6个词以内。

This is essentially a very crude implementation of Crayon's answer as a basic function: 本质上,这是Crayon答案作为基本功能的非常粗糙的实现:

withinRange <- function(string, term1, term2, threshold = 6) {
  x <- strsplit(string, " ")[[1]]
  abs(grep(term1, x) - grep(term2, x)) <= threshold
}

withinRange(string, "help", "know")
# [1] TRUE

withinRange(string, "thanks", "know")
# [1] FALSE

I would suggest getting a basic idea of the text tools available to you, and using them to write such a function. 我建议您对文本工具有一个基本的了解,并使用它们编写这样的功能。 Note Tyler's comment: As implemented, this can match multiple terms ("you" would match "you" and "your") leading to funny results. 注意Tyler的评论:实施后,它可以匹配多个术语(“ you”将匹配“ you”和“ your”),从而产生有趣的结果。 You'll need to determine how you want to deal with these cases to have a more useful function. 您需要确定如何处理这些情况以拥有更有用的功能。

you won't be able to get this from regex alone. 您将无法仅从正则表达式中获取此信息。 I suggest splitting using space as delimiter, then loop or use a built-in function to do array search of your two terms and subtract the difference of the indexes (array positions). 我建议使用空格作为定界符进行分割,然后循环或使用内置函数对两个词进行数组搜索,然后减去索引的差(数组位置)。

edit: Okay I thought about it a second and perhaps this will work for you as a regex pattern: 编辑:好吧,我想了一秒钟,也许这将为您工作的正则表达式模式:

\\bhelp(\\s+[^\\s]+){1,5}+\\s+know\\b

This takes the same "space is the delimiter" concept. 这采用相同的“空间是定界符”的概念。 First matches for help then greedily up to 5 " word" then looks for " know" (since "know" would be the 6th). 首先匹配帮助,然后贪婪地获取最多5个“单词”,然后寻找“知道”(因为“知道”将是第6个)。

Split your string: 分割字符串:

> words <- strsplit(string, '\\s')[[1]]

Build a indices vector: 建立索引向量:

> indices <- 1:length(words)

Name indices: 名称索引:

> names(indices) <- words

Compute distance between words: 计算单词之间的距离:

> abs(indices["help"] - indices["know"]) < 6
FALSE

EDIT In a function 编辑功能

 distance <- function(string, term1, term2) {
    words <- strsplit(string, "\\s")[[1]]
    indices <- 1:length(words)
    names(indices) <- words
    abs(indices[term1] - indices[term2])
 }

 distance(string, "help", "know") < 6

EDIT Plus 编辑

There is a great advantage in indexing words, once its done you can work on a lot of statistics on a text. 索引单词有很大的优势,一旦完成索引,您就可以对文本进行大量统计。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM