简体   繁体   English

在 R 的正则表达式中指定一个单词后跟一个特定单词,后跟最多 3 个单词

[英]Specifying a word followed by a specific word followed by max of 3 words in regex in R

I'm looking for a specific regex pattern which I can't seem to get:我正在寻找我似乎无法获得的特定正则表达式模式:

cryptically:神秘地:

pattern <- "[1 word|no word][this is][1-3 words max]"

text <- c("this guy cannot get a mortgage, this is a fake application", "this is a new application", "hi this is a specific question", "this is real", "this is not what you are looking for")

str_match("pattern", text)

The output I'd like to have is:我想要的 output 是:

[1]FALSE  #cause too many words in front
[2]TRUE   
[3]TRUE
[4]TRUE
[5]FALSE  #cause too many words behind it

It should be doable but im struggling with the words and max amount of it in regex Can anyone help me with this one?这应该是可行的,但我正在努力解决正则表达式中的单词和最大数量有人可以帮我解决这个问题吗?

grepl("^(\\S+\\s*)?this is\\s*\\S+\\s*\\S*\\s*\\S*$", text, perl = TRUE)
# [1] FALSE  TRUE  TRUE  TRUE FALSE

This seems a little brute-force, but it allows这似乎有点蛮力,但它允许

  • ^(\\S+\\s*)? zero or one word before零个或一个字之前
  • the literal this is (followed by zero or more blank-space), then文字this is (后跟零个或多个空格),然后
  • at a minimum, \\S+ one word (with at least one letter), then至少, \\S+一个单词(至少有一个字母),然后
  • possibly space-and-a-word \\s*\\S* , twice, allowing up to three words可能是空格和单词\\s*\\S* ,两次,最多允许三个单词

Depending on how you intend to use this, you can extract the words into a single-column or multiple columns, using strcapture (still base R):根据您打算如何使用它,您可以使用strcapture (仍以 R 为基础)将单词提取到单列或多列中:

strcapture("^(\\S+\\s*)?this is\\s*(\\S+\\s*\\S*\\s*\\S*)$", text, 
           proto = list(ign="",w1=""), perl = TRUE)[,-1,drop=FALSE]
#                    w1
# 1                <NA>
# 2   a new application
# 3 a specific question
# 4                real
# 5                <NA>

strcapture("^(\\S+\\s*)?this is\\s*(\\S+)\\s*(\\S*)\\s*(\\S*)$", text, 
           proto = list(ign="",w1="",w2="",w3=""), perl = TRUE)[,-1,drop=FALSE]
#     w1       w2          w3
# 1 <NA>     <NA>        <NA>
# 2    a      new application
# 3    a specific    question
# 4 real                     
# 5 <NA>     <NA>        <NA>

The [,-1,drop=FALSE] is because we need to (..) capture the words before "this is" so that it can be optional, but we don't need to keep them, so I drop them right away. [,-1,drop=FALSE]是因为我们需要(..)捕获"this is"之前的单词,以便它可以是可选的,但我们不需要保留它们,所以我立即删除它们. (The drop=FALSE is because base R data.frame defaults to reducing a single-column return to a vector.) drop=FALSE是因为基础 R data.frame默认将单列返回减少为向量。)


Slight improvement (less brute-force), that allows for programmatically determining the number of words to accept.轻微的改进(较少的蛮力),允许以编程方式确定要接受的单词数。

text2 <- c("this is one", "this is one two", "this is one two three", "this is one two three four", "this is one two three four five", "this not is", "hi this is")
grepl("^(\\S+\\s*)?this is\\s*(\\S+\\s*){1,4}$", text2, perl = TRUE)
# [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
grepl("^(\\S+\\s*)?this is\\s*(\\S+\\s*){1,2}$", text2, perl = TRUE)
# [1]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
grepl("^(\\S+\\s*)?this is\\s*(\\S+\\s*){1,99}$", text2, perl = TRUE)
# [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE

This doesn't necessarily work with strcapture , since it does not have a pre-defined number of groups.这不一定适用于strcapture ,因为它没有预定义数量的组。 Namely, it will only capture the last of the words:也就是说,它只会捕获最后一个单词:

strcapture("^(\\S+\\s*)?this is\\s*(\\S+\\s*){1,3}$", text2, 
           proto = list(ign="",w1=""), perl = TRUE)
#    ign    w1
# 1        one
# 2        two
# 3      three
# 4 <NA>  <NA>
# 5 <NA>  <NA>
# 6 <NA>  <NA>
# 7 <NA>  <NA>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM