简体   繁体   中英

Why is POSIX collating-related bracketed symbol higher-precedence than backslash?

POSIX, aka "The Open Group Base Specifications Issue 7, 2018 edition" , has this to say about regular expression operator precedence:

9.4.8 ERE Precedence

The order of precedence shall be as shown in the following table:

ERE Precedence (from high to low)
Collation-related bracket symbols [==] [::] [..]
Escaped characters \ special-character
Bracket expression []
Grouping ()
Single-character-ERE duplication * +? {m,n}
Concatenation ab
Anchoring ^ $
Alternation |

I am curious as to the reason for the first two levels being in that order. Being a unix user from way back, I am accustomed to being able to "throw a backslash in front of it" to escape virtually anything. But it appears that with Collation-Related-Bracket-Symbols (CRBS), I can't do that. If I want to match a literal [.ch.] I can't just type \[.ch.] and rely on "dot matches dot" to handle things for me. I now have to match something like [[].ch.] (or possibly worse?).

I'm trying, and failing, to imagine what the scenario was when whoever-thought-this-up decided this should be the order. Is there a concrete scenario where having CRBS ranked higher than backslash makes sense, or was this a case of "we don't understand CRBS yet so let's make it higher priority" or... what, exactly?

At least for Gnu grep, it looks like lib/dfa.c treats the CRBS as one lexical token, as per the function parse_bracket_exp() .

For the example given, escaping the special characters (square brackets and dots) seems to give the results you are looking for. You can also match literal dots with [.] which might be easier to see in a regular expression.

$ (echo c;echo '[.ch.]';echo .ch.;echo xchx)|grep '\[\.ch\.\]'
[.ch.]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM