简体   繁体   中英

Perl multiline regex in windows

I'm stuck with this scenario, I have this regex

*Input added here for clarity:

181221533;MG;3;1476729;<vars>  <vint>    <name>mtest</name> <storedPrecedure>f_sc_mtest</SP>    <base>M_data</base>    <dataType>I</dataType>    <timeMS>17</timeMS>    <ttidr>abc</ttidr>  <base>S</base>    <valor>0</valor>  </vint>  </vars>;889;6;85;112;01/01/2019;29/05/2019 17:17:48
182652972;MG;6314429;740484;<vars>  <vint>    <name>mtest</name>    <sP>f_sc_mtest</sP> <base>sscy</base>    <dataType>I</dataType>    <timeMS>16</timeMS>    <ttidr>abc</Idtype>    <base>S</base>    <valor>4</valor>  </vint></vars>;-1;8;57217;57228;01/01/2019;06/06/2019 22:20:48
182652984;ModeloSP;6314429;740484;<vars>  <vint>     <name>tc_p_act</name>    <sP>rndom_name</sP>    <base>sscyo</base>    <dataType>I</dataType>    <timeMS>0</timeMS>    <Idtype>XYZ</Idtype>    <base>O</base>  </vint>
</vars>;0;;0;41;01/01/2019;06/06/2019 22:31:22

182652988;ModeloSP;6314429;740484;<vars>  <vint>     <name>tc_p_act</name>    <sP>rndom_name</sP>    <base>sscyo</base>    <dataType>I</dataType>    <timeProcess>1</timeProcess>    <Idtype>XYZ</Idtype>    <base>O</base>  </vint>
</vars>;0;;0;85;01/01/2019;06/06/2019 22:37:36

And I want to implement this regex in perl with multiline support because as you can see in the sample, there are line breaks in records and this regex searchs 'incomplete' lines (and the extra line) and fixes them (one record/line should end with a datetime)

this is what I'm attempting with perl:

perl.exe -0777 -i -pe "s/(?m)^(.*)(>)([\n]+)(<)(.*)([\n]+)(\s*)$/$1$2    $4$5/igs" "sample.txt"

And doesn't seem to work, I keep getting the same text file. I'm using perl inside a portable GIT installation (v5.34.0)

Is there something I'm missing?

edit: This is how the output should look like:

181221533;MG;3;1476729;<vars>  <vint>    <name>mtest</name> <storedPrecedure>f_sc_mtest</SP>    <base>M_data</base>    <dataType>I</dataType>    <timeMS>17</timeMS>    <ttidr>abc</ttidr>  <base>S</base>    <valor>0</valor>  </vint>  </vars>;889;6;85;112;01/01/2019;29/05/2019 17:17:48
182652972;MG;6314429;740484;<vars>  <vint>    <name>mtest</name>    <sP>f_sc_mtest</sP> <base>sscy</base>    <dataType>I</dataType>    <timeMS>16</timeMS>    <ttidr>abc</Idtype>    <base>S</base>    <valor>4</valor>  </vint></vars>;-1;8;57217;57228;01/01/2019;06/06/2019 22:20:48
182652984;ModeloSP;6314429;740484;<vars>  <vint>     <name>tc_p_act</name>    <sP>rndom_name</sP>    <base>sscyo</base>    <dataType>I</dataType>    <timeMS>0</timeMS>    <Idtype>XYZ</Idtype>    <base>O</base>  </vint>    </vars>;0;;0;41;01/01/2019;06/06/2019 22:31:22
182652988;ModeloSP;6314429;740484;<vars>  <vint>     <name>tc_p_act</name>    <sP>rndom_name</sP>    <base>sscyo</base>    <dataType>I</dataType>    <timeProcess>1</timeProcess>    <Idtype>XYZ</Idtype>    <base>O</base>  </vint>    </vars>;0;;0;85;01/01/2019;06/06/2019 22:37:36

This seems to produce the wanted output:

perl.exe -0777 -pe "s: *\n(?=</):    :g;s/\n+/\n/g"
  • The first substitution replaces whitespace followed by a newline before </ by four spaces.
  • The second substitution replaces multiple newlines by a single one. You can also replace it by a transliteration: tr/\n//s , the /s "squeezes" the newlines.

Capture the whole record and replace all newlines in it by a space (using another regex inside the replacement part, courtesy of /e modifier). Then have to run another regex, to replace all multiple newlines by a single one

perl.exe -0777 -wpe'
    s{ (?:^|\R)\K (\d{9}; .*? \s+\d\d:\d\d:\d\d) }{$1 =~ s/\n+/ /r}segx; s{\n+}{\n}g
' file.txt

I consider a "record" to be: [0-9]{9}; on line/file beginning, then all up to and including a timestamp after spaces. The detail for beginning and end of record should protect against accidental matching of possible unexpected patterns inside those tags.

This is cumbersome but it captures the record correctly I hope, even if some details change.


Apparently as it stands the above fails on Windows while it is confirmed to work on Linux (the only system I can try it right now).

The most likely issue would be in newlines -- so try replacing \n in matches with \R or, rather, with \r\n . In particular in the regex embedded in the replacement part. Or, to be safe and perhaps portable, replace \n with (\r?\n) (so the carriage return character is optional, need not be there)

If the issue is having newlines in the wrong place, either multiple newlines in a row, or before a < , you may get away with something simple like this:

use strict;
use warnings;

my $str = do { local $/; <DATA> };

$str =~ s/\n(?=[<\n])//g;
print $str;

__DATA__
181221533;<valor>0</valor></vars>;889;6;85;112;01/01/2019;29/05/2019 17:17:48
182652972;</vars>;-1;8;57217;57228;01/01/2019;06/06/2019 22:20:48
182652984;</vint>
</vars>;0;;0;41;01/01/2019;06/06/2019 22:31:22

182652988; </vint>
</vars>;0;;0;85;01/01/2019;06/06/2019 22:37:36

(I shortened the input to make it readable)

Output:

181221533;<valor>0</valor></vars>;889;6;85;112;01/01/2019;29/05/2019 17:17:48
182652972;</vars>;-1;8;57217;57228;01/01/2019;06/06/2019 22:20:48
182652984;</vint></vars>;0;;0;41;01/01/2019;06/06/2019 22:31:22
182652988; </vint></vars>;0;;0;85;01/01/2019;06/06/2019 22:37:36

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM