简体   繁体   中英

regex for 3 consecutive words if there are any

I am looking for regex that extract 3 consecutive words if there are any. For example, if I have 2 strings:

"1. Stack is great and awesome"
"2. Stack"

The result is:

"Stack is great"
"Stack" 

This answer doesn't work for me: regex: matching 3 consecutive words

My effort:

(?:[A-ZŠČĆŽa-zščćž]+ )(?:[A-ZŠČĆŽa-zščćž]+ )(?:[A-ZŠČĆŽa-zščćž]+ )

You may use

> x <- c("1. Stack is great and awesome", "2. Stack")
> regmatches(x, regexpr("[A-Za-z]+(?:\\s+[A-Za-z]+){0,2}", x))
[1] "Stack is great" "Stack"
## Or to support all Unicode letters
> y <- c("1. Stąck is great and awesome", "2. Stack")
> regmatches(y, regexpr("\\p{L}+(?:\\s+\\p{L}+){0,2}", y, perl=TRUE))
[1] "Stąck is great" "Stack"
## In some R environments, it makes sense to use another, TRE, regex:
> regmatches(y, regexpr("[[:alpha:]]+(?:[[:space:]]+[[:alpha:]]+){0,2}", x))
[1] "Stąck is great" "Stack"

See the regex demo and the online R demo and an alternative regex demo .

Note that the regex will extract the first chunk of 1, 2 or 3 letter words from any string. If you need at least 2 words, replace {0,2} limiting quantifier with {1,2} one.

To extract multiple matches, use gregexpr rather than regexpr .

Pattern details

  • \\\\p{L}+ / [A-Za-z] - any 1+ Unicode (or ASCII if [A-Za-z] is used) letters
  • (?:\\\\s+\\\\p{L}+){0,2} / (?:\\\\s+[a-zA-Z]+){0,2} - 0, 1 or 2 consecutive occurrences of:
    • \\\\s+ - 1+ whitespaces
    • \\\\p{L}+ / [A-Za-z] - any 1+ Unicode (or ASCII if [A-Za-z] is used) letters

Mind using the perl=TRUE argument with the regex that uses \\p{L} construct. If it does not work, try adding the (*UCP) PCRE verb at the very beginning of the pattern that makes the all generic/Unicode/shorthand classes really Unicode aware.

Note that all these regexps will work with stringr::str_extract and stringr::str_extract_all :

> str_extract(x, "\\p{L}+(?:\\s+\\p{L}+){0,2}")
[1] "Stack is great" "Stack"         
> str_extract(x, "[a-zA-Z]+(?:\\s+[a-zA-Z]+){0,2}")
[1] "Stack is great" "Stack"         
> str_extract(x, "[[:alpha:]]+(?:\\s+[[:alpha:]]+){0,2}")
[1] "Stack is great" "Stack" 

There is no support for (*UCP) here as stringr functions are ICU regex powered, not PCRE. Unicode test:

> str_extract(y, "\\p{L}+(?:\\s+\\p{L}+){0,2}")
[1] "Stąck iç great" "Stack"         
> str_extract(y, "[[:alpha:]]+(?:\\s+[[:alpha:]]+){0,2}")
[1] "Stąck iç great" "Stack"         

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM