在 R 的正則表達式中指定一個單詞后跟一個特定單詞，后跟最多 3 個單詞

Question

我正在尋找我似乎無法獲得的特定正則表達式模式：

神秘地：

pattern <- "[1 word|no word][this is][1-3 words max]"

text <- c("this guy cannot get a mortgage, this is a fake application", "this is a new application", "hi this is a specific question", "this is real", "this is not what you are looking for")

str_match("pattern", text)

我想要的 output 是：

[1]FALSE  #cause too many words in front
[2]TRUE   
[3]TRUE
[4]TRUE
[5]FALSE  #cause too many words behind it

這應該是可行的，但我正在努力解決正則表達式中的單詞和最大數量有人可以幫我解決這個問題嗎？

Answer 1

grepl("^(\\S+\\s*)?this is\\s*\\S+\\s*\\S*\\s*\\S*$", text, perl = TRUE)
# [1] FALSE  TRUE  TRUE  TRUE FALSE

這似乎有點蠻力，但它允許

^(\\S+\\s*)? 零個或一個字之前
文字this is （后跟零個或多個空格），然后
至少， \\S+一個單詞（至少有一個字母），然后
可能是空格和單詞\\s*\\S* ，兩次，最多允許三個單詞

根據您打算如何使用它，您可以使用strcapture （仍以 R 為基礎）將單詞提取到單列或多列中：

strcapture("^(\\S+\\s*)?this is\\s*(\\S+\\s*\\S*\\s*\\S*)$", text, 
           proto = list(ign="",w1=""), perl = TRUE)[,-1,drop=FALSE]
#                    w1
# 1                <NA>
# 2   a new application
# 3 a specific question
# 4                real
# 5                <NA>

strcapture("^(\\S+\\s*)?this is\\s*(\\S+)\\s*(\\S*)\\s*(\\S*)$", text, 
           proto = list(ign="",w1="",w2="",w3=""), perl = TRUE)[,-1,drop=FALSE]
#     w1       w2          w3
# 1 <NA>     <NA>        <NA>
# 2    a      new application
# 3    a specific    question
# 4 real                     
# 5 <NA>     <NA>        <NA>

[,-1,drop=FALSE]是因為我們需要(..)捕獲"this is"之前的單詞，以便它可以是可選的，但我們不需要保留它們，所以我立即刪除它們. （ drop=FALSE是因為基礎 R data.frame默認將單列返回減少為向量。）

輕微的改進（較少的蠻力），允許以編程方式確定要接受的單詞數。

text2 <- c("this is one", "this is one two", "this is one two three", "this is one two three four", "this is one two three four five", "this not is", "hi this is")
grepl("^(\\S+\\s*)?this is\\s*(\\S+\\s*){1,4}$", text2, perl = TRUE)
# [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
grepl("^(\\S+\\s*)?this is\\s*(\\S+\\s*){1,2}$", text2, perl = TRUE)
# [1]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
grepl("^(\\S+\\s*)?this is\\s*(\\S+\\s*){1,99}$", text2, perl = TRUE)
# [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE

這不一定適用於strcapture ，因為它沒有預定義數量的組。 也就是說，它只會捕獲最后一個單詞：

strcapture("^(\\S+\\s*)?this is\\s*(\\S+\\s*){1,3}$", text2, 
           proto = list(ign="",w1=""), perl = TRUE)
#    ign    w1
# 1        one
# 2        two
# 3      three
# 4 <NA>  <NA>
# 5 <NA>  <NA>
# 6 <NA>  <NA>
# 7 <NA>  <NA>

在 R 的正則表達式中指定一個單詞后跟一個特定單詞，后跟最多 3 個單詞

問題描述

1 個解決方案

解決方案1
1 已采納 2020-12-22 15:40:28

在 R 的正則表達式中指定一個單詞后跟一個特定單詞，后跟最多 3 個單詞

問題描述

1 個解決方案

解決方案1 1 已采納 2020-12-22 15:40:28

解決方案1
1 已采納 2020-12-22 15:40:28