简体   繁体   中英

grab n letter words don't count apostrophes regex

I'm trying to learn regex in R more deeply. I gave myself what I thought was an easy task that I can't figure out. I want to extract all 4 letter words. In these four letter words I want to ignore (don't count) apostrophes. I can do this without regex but want a regex solution. Here's a MWE and what I've tried:

text.var <- "This Jon's dogs' 'bout there in Mike's re'y word."
pattern <- "\\b[A-Za-z]{4}\\b(?!')"
pattern <- "\\b[A-Za-z]{4}\\b|\\b[A-Za-z']{5}\\b"

regmatches(text.var, gregexpr(pattern, text.var, perl = TRUE)) 

** Desired output:**

[[1]]
[1] "This"  "Jon's"  "dogs'"  "'bout"  "word"

I thought the second pattern would work but it grabs words containing 5 characters as well.

This is a good challenging question and here is a tricky answer.

> x  <- "This Jon's dogs' 'bout there in Mike's re'y word."
> re <- "(?i)('?[a-z]){5,}(*SKIP)(?!)|('?[a-z]){4}'?"
> regmatches(x, gregexpr(re, x, perl=T))[[1]]
## [1] "This"  "Jon's" "dogs'" "'bout" "word" 

Explanation :

The idea is to skip any word patterns that consist of 5 or more letter characters and an optional apostrophe.

On the left side of the alternation operator we match the subpattern we do not want. Making it fail and forcing the regular expression engine to not retry the substring using backtracking control. As explained below:

(*SKIP) # advances to the position in the string where (*SKIP) was 
        # encountered signifying that what was matched leading up 
        # to cannot be part of the match

(?!)    # equivalent to (*FAIL), causes matching failure, 
        # forcing backtracking to occur

The right side of the alternation operator matches what we want...

Additional Explanation:

  • Essentially, in simple terms you are using the discard technique .

     (?:'?[az]){5,}|((?:'?[az]){4}'?) 

    You use the alternation operator in context placing what you want to exclude on the left, ( saying throw this away, it's garbage ) and place what you want to match in a capturing group on the right side.

您可以使用此模式:

(?i)(?<![a-z'])(?:'?[a-z]){4}'?(?![a-z'])

You can use the discard technique and use a regex like this:

\b\w{0,2}\b(?:'\w)?|\b\w{3}(?!')\b|\b\w{5,}\b|('?\b\w+\b'?\w?)

Working demo

在此输入图像描述

MATCH 1
1.  [0-4]   `This`
MATCH 2
1.  [5-10]  `Jon's`
MATCH 3
1.  [11-16] `dogs'`
MATCH 4
1.  [17-22] `'bout`
MATCH 5
1.  [32-36] `word`

For R it is needed to be escaped the special characters.

As you can see in the regex pattern you can use whatever you don't want at the left side of the pattern and leaving what you really want inside the capturing group at the rightest side. The idea behind the discard technique is:

discard this|don't want this|still don't care this|(Oh yeah! I grab this)

THANKS to EdConttrell and johnwait for helping me to improve the answer.

EDITED twice: (thanks hex494D49 ):

(?i)(?<=\\W|^)(?<!')'*(?:\\w{4}|\\w'*\\w{3}|\\w{2}'*\\w{2}|\\w{3}'*\\w|\\w{2}'*\\w'*\\w|\\w'*\\w{2}'*\\w|\\w'*\\w'*\\w{2}|\\w'*\\w'*\\w'*\\w)'*(?!')(?=\\W|$)

Better go for every possible cases...

But , title of question states :

grab n letter words don't count apostrophes regex

So I would not recommend my solution.

Another solution that I think may be slightly clearer / more concise:

Regex

(?<![\w'])(?:'?\w'?){4}(?![\w'])

Explanation

(?<![\w'])

This is a Negative Lookbehind Assertion: it checks that the match is not preceded by the ' char or a word char ( \\w is the same as [a-zA-Z] ).

(?:'?\w'?){4}

This matches any word char, optionally preceded/succeeded by a ' . The (?: ... ) makes the group non-capturing.

(?![\w'])

This is a Negative Lookahead assertion, ensuring that the group is not followed by another apostrophe or letter char.


The purpose of the first and last terms is to ensure that the 4 matches by the middle group are not surrounded by more characters: ie the word only has 4 letters.

They are more or less equivalent to a \\b word boundary detection, except that they count an apostrophe as part of a word which \\b does not.

Issues

The regex won't match strings that start or end with double apostrophes, '' . I don't think this is a huge loss.

Example

See this link on regex101.com .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM