[英]Specifying a word followed by a specific word followed by max of 3 words in regex in R
I'm looking for a specific regex pattern which I can't seem to get:我正在寻找我似乎无法获得的特定正则表达式模式:
cryptically:神秘地:
pattern <- "[1 word|no word][this is][1-3 words max]"
text <- c("this guy cannot get a mortgage, this is a fake application", "this is a new application", "hi this is a specific question", "this is real", "this is not what you are looking for")
str_match("pattern", text)
The output I'd like to have is:我想要的 output 是:
[1]FALSE #cause too many words in front
[2]TRUE
[3]TRUE
[4]TRUE
[5]FALSE #cause too many words behind it
It should be doable but im struggling with the words and max amount of it in regex Can anyone help me with this one?这应该是可行的,但我正在努力解决正则表达式中的单词和最大数量有人可以帮我解决这个问题吗?
grepl("^(\\S+\\s*)?this is\\s*\\S+\\s*\\S*\\s*\\S*$", text, perl = TRUE)
# [1] FALSE TRUE TRUE TRUE FALSE
This seems a little brute-force, but it allows这似乎有点蛮力,但它允许
^(\\S+\\s*)?
zero or one word beforethis is
(followed by zero or more blank-space), thenthis is
(后跟零个或多个空格),然后\\S+
one word (with at least one letter), then\\S+
一个单词(至少有一个字母),然后\\s*\\S*
, twice, allowing up to three words\\s*\\S*
,两次,最多允许三个单词Depending on how you intend to use this, you can extract the words into a single-column or multiple columns, using strcapture
(still base R):根据您打算如何使用它,您可以使用
strcapture
(仍以 R 为基础)将单词提取到单列或多列中:
strcapture("^(\\S+\\s*)?this is\\s*(\\S+\\s*\\S*\\s*\\S*)$", text,
proto = list(ign="",w1=""), perl = TRUE)[,-1,drop=FALSE]
# w1
# 1 <NA>
# 2 a new application
# 3 a specific question
# 4 real
# 5 <NA>
strcapture("^(\\S+\\s*)?this is\\s*(\\S+)\\s*(\\S*)\\s*(\\S*)$", text,
proto = list(ign="",w1="",w2="",w3=""), perl = TRUE)[,-1,drop=FALSE]
# w1 w2 w3
# 1 <NA> <NA> <NA>
# 2 a new application
# 3 a specific question
# 4 real
# 5 <NA> <NA> <NA>
The [,-1,drop=FALSE]
is because we need to (..)
capture the words before "this is"
so that it can be optional, but we don't need to keep them, so I drop them right away. [,-1,drop=FALSE]
是因为我们需要(..)
捕获"this is"
之前的单词,以便它可以是可选的,但我们不需要保留它们,所以我立即删除它们. (The drop=FALSE
is because base R data.frame
defaults to reducing a single-column return to a vector.) (
drop=FALSE
是因为基础 R data.frame
默认将单列返回减少为向量。)
Slight improvement (less brute-force), that allows for programmatically determining the number of words to accept.轻微的改进(较少的蛮力),允许以编程方式确定要接受的单词数。
text2 <- c("this is one", "this is one two", "this is one two three", "this is one two three four", "this is one two three four five", "this not is", "hi this is")
grepl("^(\\S+\\s*)?this is\\s*(\\S+\\s*){1,4}$", text2, perl = TRUE)
# [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE
grepl("^(\\S+\\s*)?this is\\s*(\\S+\\s*){1,2}$", text2, perl = TRUE)
# [1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE
grepl("^(\\S+\\s*)?this is\\s*(\\S+\\s*){1,99}$", text2, perl = TRUE)
# [1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE
This doesn't necessarily work with strcapture
, since it does not have a pre-defined number of groups.这不一定适用于
strcapture
,因为它没有预定义数量的组。 Namely, it will only capture the last of the words:也就是说,它只会捕获最后一个单词:
strcapture("^(\\S+\\s*)?this is\\s*(\\S+\\s*){1,3}$", text2,
proto = list(ign="",w1=""), perl = TRUE)
# ign w1
# 1 one
# 2 two
# 3 three
# 4 <NA> <NA>
# 5 <NA> <NA>
# 6 <NA> <NA>
# 7 <NA> <NA>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.