简体   繁体   中英

make regex not capture the OR capture group

So, I am struggling to capture what kind of "language" snippet the string contains:

The language snippet are inside () and are combination of: En,Fr,De,Es,It

Example:

File (En,Fr,De,Es,It).doc    <== should match all 5 languages
File (En,Fr) (Required).doc  <== should match `En` and `Fr`
File (Enfoo,Fr).doc          <== should match only `Fr`
File (E,Fr).doc              <== should match only `Fr`

My current regex:

((\\(|,)En(\\)|,))|((\\(|,)Fr(\\)|,))|((\\(|,)De(\\)|,))|((\\(|,)Es(\\)|,))|((\\(|,)It(\\)|,))

What does it mean:

((\(|,)  <== either starts with `open parenthesis` or `comma`  (1)
En       <== the language                                      (2)
(\)|,))  <== either ends with `close parenthesis` or `comma`   (3)

then I just append with regex OR (|)

The problem as you can see: regexr.com/3ev6p is that if there is a second language snippet ie Fr it won't satisfy the regex (1) because the first language snippet En is capturing/occupying the open parenthesis or comma already, resulting for the 2nd language snippet Fr to be not matched...

在此处输入图片说明

Do you guys know how to handle completely capture all the language snippet? I am planning to use PHP's preg_match_all() to get all these. Hope somebody can help. Thank you!

The regex you have consumes the commas around the language codes. That mean, after finding a match, the index is after a comma, and since there cannot be a match, the language after that comma is skipped by the regex engine.

In order to match such overlapping matches lookarounds can be used:

(?<=[(,])(En|Fr|De|Es|It)(?=[,)])
^^^^^^^^^                ^^^^^^^^

See this regex demo .

The (?<=[(,]) is a positive lookbehind that requires a , or ( before the language code, and (?=[,)]) is a positive lookahead that requires a comma or ) to the right of the language code, but the comma/parenthesis is not consumed, it remains to be matched during the next iteration.

Another solution that is possible here is the use of word boundaries (as is already described in the comments). Word boundaries help match whole words.

\b(En|Fr|De|Es|It)\b

See the regex demo

This should match all:

 (?<=,|\()(\w\w)(?=,|\))

Accompanied by preg_match_all should do the job.

Explained:

  • A lookbehind assertion (should be preceded by "," or "(")
  • Two word characters (So you don't have to specify which languages you are targeting beforehand).
  • A look ahead assertion (should be followed by "," or ")")

And thats it. :)

Working version .

Regards.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM