简体   繁体   English

在特定单词之前和之后提取5个单词

[英]Extracting 5 words before and after an specific word

How can I extract the words/sentence next to an specific word? 如何提取特定单词旁边的单词/句子? Example: 例:

"On June 28, Jane went to the cinema and ate popcorn" “ 6月28日,简去电影院吃了爆米花”

I would like to choose 'Jane' and get [-2,2], meaning: 我想选择'Jane'并得到[-2,2],表示:

"June 28, Jane went to" “ 6月28日,简去了”

Here's an example with an expansion for multiple occurrences. 这是一个扩展了多次出现的示例。 Basically, split on whitespace, find the word, expand the indices, then make a list of results. 基本上,在空白处分割,找到单词,展开索引,然后列出结果。

s <- "On June 28, Jane went to the cinema and ate popcorn. The next day, Jane hiked on a trail."
words <- strsplit(s, '\\s+')[[1]]
inds <- grep('Jane', words)
lapply(inds, FUN = function(i) {
  paste(words[max(1, i-2):min(length(words), i+2)], collapse = ' ')
})
#> [[1]]
#> [1] "June 28, Jane went to"
#> 
#> [[2]]
#> [1] "next day, Jane hiked on"

Created on 2019-09-17 by the reprex package (v0.3.0) reprex软件包 (v0.3.0)创建于2019-09-17

We could make a function to help out. 我们可以提供帮助的功能。 This might make it a little more dynamic. 这可能会使它更具动态性。

library(tidyverse)

txt <- "On June 28, Jane went to the cinema and ate popcorn"

grab_text <- function(text, target, before, after){
  min <- which(unlist(map(str_split(text, "\\s"), ~grepl(target, .x))))-before
  max <- which(unlist(map(str_split(text, "\\s"), ~grepl(target, .x))))+after

  paste(str_split(text, "\\s")[[1]][min:max], collapse = " ")
}

grab_text(text = txt, target = "Jane", before = 2, after  = 2)
#> [1] "June 28, Jane went to"

First we split the sentence, then we figure out the position of the target, then we grab any word before or after (number specified in the function), last we collapse the sentence back together. 首先,我们将句子拆分,然后找出目标的位置,然后抓取单词之前或之后的任何单词(函数中指定的数字),最后将单词折叠在一起。

I have a shorter version using str_extract from stringr 我从stringr使用str_extract有一个较短的版本

library(stringr)
txt <- "On June 28, Jane went to the cinema and ate popcorn"
str_extract(txt,"([^\\s]+\\s+){2}Jane(\\s+[^\\s]+){2}")

[1] "June 28, Jane went to"

The function str_extract extract the patern form the string. 函数str_extract从字符串中提取模式。 The regex \\\\s is for white space, and [^\\\\s] is the negation of it, so anything but white space. 正则表达式\\\\s用于空格,而[^\\\\s]是空格的否定符,因此除空格之外的任何东西都没有。 So the whole pattern is Jane with twice a white space before and after and something composed by anything but whitespace 因此整个模式是Jane ,前后有两次空白,由空白以外的任何东西组成

The advantage is that it is already vectorized, and if you have a vector of text you can use str_extract_all : 好处是它已经被矢量化了,如果您有矢量的文本,则可以使用str_extract_all

s <- c("On June 28, Jane went to the cinema and ate popcorn. 
          The next day, Jane hiked on a trail.",
       "an indeed Jane loved it a lot")

str_extract_all(s,"([^\\s]+\\s+){2}Jane(\\s+[^\\s]+){2}")

[[1]]
[1] "June 28, Jane went to"   "next day, Jane hiked on"

[[2]]
[1] "an indeed Jane loved it"

这应该工作:

stringr::str_extract(text, "(?:[^\\s]+\\s){5}Jane(?:\\s[^\\s]+){5}")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM