R有没有办法在字符串中找到一定范围内的单词（或句子）的组合

Question

我正在尝试查找所有字符串，其中包含单词/句子的组合以及将它们分开但具有固定限制的其他单词。

示例：我想要“bought”和“watch”的组合，但最多用 2 个词分隔它们。

I bought a beautiful and shiny watch -> not ok because there's 4 words between "bought" 和 "watch" ("a beautiful and shiny")
我买了一块 shiny 手表 -> 可以，因为在“买”和“手表”（“闪亮”）之间有 2 个词

我在 R 上找不到任何接近我想要的东西。

要在字符串中查找简单的单词/句子，我使用str_extract_all中的stringr ，如下所示：

my_analysis <- str_c("\\b(", str_c(my_list_of_words_and_sentences, collapse="|"), ")\\b")
df$words_and_sentences_found <- str_extract_all(df$my_strings, my_analysis)

Answer 1

一种思考方式：

my_list2 <- list("I bought a beautiful and shiny watch", "I bought a shiny watch", 
    "It was not bought but watch")
as_words <- unlist(str_split(my_list2, ' '))
t1 <- which(as_words == 'bought')
t2 <- which(as_words == 'watch')
t1
[1]  2  9 16
t2
[1]  7 12 18
t2-t1
[1] 5 3 2

Answer 2

您可以为此使用skip-grams ：

library(tidyverse)
library(tidytext)

df <- tibble(id = 1:3,
             txt = c("I bought a beautiful and shiny watch", 
                     "I bought a shiny watch", 
                     "The watch is very shiny"))

tidy_ngrams <- df %>%
  ## use k for the skip, and n for what degree of n-gram:
  unnest_tokens(ngram, txt, token = "skip_ngrams", n_min = 2, n = 2, k = 2) 

tidy_ngrams
#> # A tibble: 33 × 2
#>       id ngram           
#>    <int> <chr>           
#>  1     1 i bought        
#>  2     1 i a             
#>  3     1 i beautiful     
#>  4     1 bought a        
#>  5     1 bought beautiful
#>  6     1 bought and      
#>  7     1 a beautiful     
#>  8     1 a and           
#>  9     1 a shiny         
#> 10     1 beautiful and   
#> # … with 23 more rows

tidy_ngrams %>%
  filter(ngram == "bought watch")
#> # A tibble: 1 × 2
#>      id ngram       
#>   <int> <chr>       
#> 1     2 bought watch

^{由reprex package (v2.0.1) 创建于 2022-06-03}

R有没有办法在字符串中找到一定范围内的单词（或句子）的组合

问题描述

2 个解决方案

解决方案1
1 2022-06-01 16:49:08

解决方案2
0 已采纳 2022-06-03 21:38:28

R有没有办法在字符串中找到一定范围内的单词（或句子）的组合

问题描述

2 个解决方案

解决方案1 1 2022-06-01 16:49:08

解决方案2 0 已采纳 2022-06-03 21:38:28

解决方案1
1 2022-06-01 16:49:08

解决方案2
0 已采纳 2022-06-03 21:38:28