简体   繁体   中英

regex to match of the occurrence for either “this” or “that” at least twice in a sentence

I want create a regex in PHP that searches the sentences in a text which contain "this" or "that" at least twice (so at least twice "this" or at least twice "that")

We got stuck at:

([^.?!]*(\bthis|that\b){2,}[^.?!]*[.|!|?]+)

Use this Pattern (\\b(?:this|that)\\b).*?\\1 Demo

(               # Capturing Group (1)
  \b            # <word boundary>
  (?:           # Non Capturing Group
    this        # "this"
    |           # OR
    that        # "that"
  )             # End of Non Capturing Group
  \b            # <word boundary>
)               # End of Capturing Group (1)
.               # Any character except line break
*?              # (zero or more)(lazy)
\1              # Back reference to group (1)

This is mostly Wiktor's pattern with a deviation to isolate the sentences and omit the leading white-space characters from the fullstring matches.

Pattern: /\\b[^.?!]*\\b(th(?:is|at))\\b[^.?!]*(\\b\\1\\b)[^.?!]*\\b[.!?]/i

Here is a sample text that will demonstrate how the other answers will not correctly disqualify unwanted matches for "word boundary" or "case-insensitive" reasons: ( Demo - capture group applied to \\b\\1\\b in the demo to show which substrings are qualifying the sentences for matching )

This is nothing.
That is what that will be.
The Indian policeman hit the thief with his lathis before pushing him into the thistles.
This Indian policeman hit the thief with this lathis before pushing him into the thistles.  This is that and that.
The Indian policeman hit the thief with this lathis before pushing him into the thistles.

To see the official breakdown of the pattern, refer to the demo link.

In plain terms:

/                  #start of pattern
\b                 #match start of a sentence on a "word character"
[^.?!]*            #match zero or more characters not a dot, question mark, or exclamation
\b(th(?:is|at))\b  #match whole word "this" or "that"  (not thistle)
[^.?!]*            #match zero or more characters not a dot, question mark, or exclamation
\b\1\b             #match the earlier captured whole word "this" or "that"
[^.?!]*            #match zero or more characters not a dot, question mark, or exclamation
\b                 #match second last character of sentence as "word character"
[.!?]              #match the end of a sentence: dot, question mark, exclamation
/                  #end of pattern
i                  #make pattern case-insensitive

The pattern will match three of the five sentences from the above sample text:

That this is what that will be.
This Indian policeman hit the thief with this lathis before pushing him into the thistles.
This is that and that.

*note, previously I was using \\s*\\K at the start of my pattern to omit the white-space characters. I've elected to alter my pattern to use additional word boundary meta-characters for improved efficiency. If this doesn't work with your project text, it may be better to revert to my original pattern .

Use this

.*(this|that).*(this|that).*

http://regexr.com/3ggq5

UPDATE :

This is another way, based in your regex:

.*(this\s?|that\s?){2,}.*[\.\n]*

http://regexr.com/3ggq8

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM