简体   繁体   中英

DotAll and multiline RegEx

i got a little trouble using Rexex in Powershell. It seems like there is a imlementation error or something.

The text i want to work with is a html file, which looks like this (Example1):

<span>[Mobile: %mobile% |] Phone: %telephone% [| Fax: %faxNumber%]</span>
<Span>

The Problem is that, caused by html editors, i also may get something like this (Example2):

<span>[Mobile: 

%mobile% |] Phone: %telephone% [| Fax: &nbsp;&nbsp;%faxNumber%]</span>

So as you see, we got linebreaks and html escaped, fixed whitespaces &nbsp; .

My Powershell Regex looks like this:

$x = $x -ireplace '(?ms)\[(.?){7}Fax(.*?)\]', 'MyReplacement1'

and this

$x = $x -ireplace '(?ms)\[(.?){7}Mobile(.*?)\]', 'MyReplacement2'

Basicly The [ marks the beginning of a variable and ] the end of it. Two problems arise from this:

  1. Since we got two variables, mobile and fax, i'm using (.?){7} to allow SOME (here exacly 7) characters and avoid matching the hole part between the first [ near Mobile and the last ] near Fax (which would happen if i would be using (.*?) instead of (.?){7} ). I'm not sure if there are alternatives so that i can allow ANY number (and not 7) of chars between the starting [ and the variable keyword "Fax" for example. This would be usefull to avoid missmatches when stuff like &nbsp;&nbsp; gets added (where only 7 char would not be enough and like i said (.*?) will fail). Hope i was able to explain it (kinda hard) - if not: please feel free to ask!
  2. Powershells -replace method dosn't offer a way to set regex options, therefore i got to use (?ms) to set DotAll and multiline modes. As you see, I'm using it within my regex pattern. However: when a newline is added, as you see in example2 between the words Mobile: and %mobile% , the regex fails and nothing gets replaced!

I'm greatfull for any help and even regex recommandations from the pros to avoid any further problems i'm not thinking about right now...

EDIT: (Example3):

<span>[Mobile: 

%mobile% |] Phone: %telephone% [| Fax: 
%faxNumber%]</span>

The trick around DotAll mode is to use [\\s\\S] instead of . . This character class matches any character (because it matches space and non-space characters). (As does [\\w\\W] or [\\d\\D] , but the spaces seem to be kind of a convention.)

To get around the 7 you can simply disallow closing ] before the one you actually want to match (that by the way also makes DotAll unnecessary). So something like this should work fine for you:

\[([^\]:]*)Fax([^\]]*)\]

It looks a bit ugly, but it simply means this:

\[        # literal [
(         # capturing group 1
  [^\]:]* # match as many non-:, non-] characters as possible
)         # end of group 1
Fax       # literal Fax
(         # capturing group 2
  [^\]]*  # match as many non-] characters as possible
)         # end of group 2
\]        # literal ]

Further reading on character classes.

Note that none of these patterns need multiline mode m (neither yours nor mine), because all it does is make ^ and $ match line beginnings and endings, respectively. But none of the patterns contain these meta-characters. So the modifier does not do anything.

My console output:

PS> $x = "<span>[Mobile: %mobile% |] Phone: %telephone% [| Fax: &nbsp;&nbsp;%faxNumber%]</span>"
PS> $x -ireplace '\[([^\]:]*)Mobile([^\]]*)\]', 'MyReplacement1'
<span>MyReplacement1 Phone: %telephone% [| Fax: &nbsp;&nbsp;%faxNumber%]</span>
PS> $x -ireplace '\[([^\]:]*)Fax([^\]]*)\]', 'MyReplacement2'
<span>[Mobile: %mobile% |] Phone: %telephone% MyReplacement2</span>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM