简体   繁体   中英

perl6 regex: match all punctuations except . and "

I read some threads on matching "X except Y", but none specific to perl6. I am trying to match and replace all punctuation except . and "

> my $a = ';# -+$12,678,93.45 "foo" *&';
;# -+$12,678,93.45 "foo" *&

> my $b = $a.subst(/<punct - [\.\"]>/, " ", :g);
===SORRY!===
Unrecognized regex metacharacter - (must be quoted to match literally)
------> my $b = $a.subst(/<punct⏏ - [\.\"]>/, " ", :g);
Unrecognized regex metacharacter   (must be quoted to match literally)
------> my $b = $a.subst(/<punct -⏏ [\.\"]>/, " ", :g);
Unable to parse expression in metachar:sym<assert>; couldn't find final '>' (corresponding starter was at line 1)
------> my $b = $a.subst(/<punct - ⏏[\.\"]>/, " ", :g);

> my $b = $a.subst(/<punct-[\.\"]>/, " ", :g);
===SORRY!=== Error while compiling:
Unable to parse expression in metachar:sym<assert>; couldn't find final '>' (corresponding starter was at line 1)
------> my $b = $a.subst(/<punct⏏-[\.\"]>/, " ", :g);
    expecting any of:
        argument list
        term

> my $b = $a.subst(/<punct>-<[\.\"]>/, " ", :g);
===SORRY!===
Unrecognized regex metacharacter - (must be quoted to match literally)
------> my $b = $a.subst(/<punct>⏏-<[\.\"]>/, " ", :g);
Unable to parse regex; couldn't find final '/'
------> my $b = $a.subst(/<punct>-⏏<[\.\"]>/, " ", :g);

> my $b = $a.subst(/<- [\.\"] + punct>/, " ", :g); # $b is blank space, not want I want
                       
> my $b = $a.subst(/<[\W] - [\.\"]>/, " ", :g);
      12 678 93.45 "foo"   
# this works, but clumsy; I want to 
# elegantly say: punctuations except \, and \" 
# using predefined class <punct>;

What is the best approach?

I think the most natural solution is to use a "character class arithmetic expression". This entails using + and - prefixes on any number of either Unicode properties or [...] character classes:

                            #;# -+$12,678,93.45 "foo" *&
<+:punct -[."]>             #    +$12 678 93.45 "foo"

This can be read as "the class of characters that have the Unicode property punct minus the . and " characters".


Your input string includes + and $ . These are not considered "punctuation" characters. You could explicitly add them to the set of characters being replaced by spaces:

<:punct +[+$] -[."] >       #      12 678 93.45 "foo"   

(I've dropped the initial + before :punct . If you don't write a + or - for the first item in a character class arithmetic expression then + is assumed.)

There's a Unicode property that covers all "symbols" including + and $ so you could use that instead:

<:punct +:symbol -[."] >    #      12 678 93.45 "foo"

To recap, you can combine any number of:

  • Unicode properties like :punct that start with a : and correspond to some character property specified by Unicode; or

  • [...] character classes that enumerate specific characters, backslash character classes (like \\d ), or character ranges (eg a..z ).


If an overall <...> assertion is to be a character class arithmetic expression then the first character after the opening < must be one of four characters:

  • : introducing a Unicode property (eg <:punct ...> );

  • [ introducing a [...] character class (eg <[abc ...> );

  • + or a - . This may be followed by spaces. It must then be followed by either a Unicode property ( :foo ) or a [...] character class (eg <+ :punct ...> ).

Thereafter each additional property or character class in the same overall character class arithmetic expression must be preceded by a + or - with or without additional spaces (eg <:punct - [."] ...> ).


You can group sub-expressions in parentheses.


I'm not sure what the precise semantics of + and - are. I note this surprising result:

say $a.subst(/<-[."] +:punct>/, " ", :g); # substitutes ALL characters!?! 

Built ins of the form <...> are not accepted in character class arithmetic expressions.

This is true even if they're called "character classes" in the doc. This includes ones that are nothing like a character class (eg <ident> is called a character class in the doc even though it matches a string of multiple characters which string matches a particular pattern !) but also ones that seem like they are character classes like <punct> or <digit> . (Many of these latter correspond directly to Unicode properties so you just use those instead.)


To use a backslash "character class" like \\d in a character class arithmetic expression using + and - arithmetic you must list it within a [...] character class.

Combining assertions

While <punct> can't be combined with other assertions using character class arithmetic it can be combined with other regex constructs using the & regex conjunction operator :

<punct> & <-[."]>           #    +$12 678 93.45 "foo"

Depending on the state of compiler optimization (and as of 2019 there's been almost no effort applied to the regex engine), this will be slower in general than using real character classes.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM