简体   繁体   中英

perl command line multiline regex substitute

I'm trying to substitute a multiline block using perl command line. the text is the following:

@LNCaP.2622 GAPC:1:1:4519:1350 length=76
TTTCCATTGCAGGTTTTAAAGTGGAGATTCTGAAGGGGAAAATAGGCACTGTCAGAACAAAGCTACCTGGAAACAG
+LNCaP.2622 GAPC:1:1:4519:1350 length=76
DD@:BBBBDDD@D:B::=:6:(6//;589444004':839>>2;;:':>>:7B:><B<B#################
@LNCaP.2623 GAPC:1:1:4767:1343 length=76

+LNCaP.2623 GAPC:1:1:4767:1343 length=76

@LNCaP.2624 GAPC:1:1:4794:1349 length=76

and I tried to run the following regex:

perl -pe "s/^@.*\n\s*\n+//mg" test.txt

hoping to get the following output:

@LNCaP.2622 GAPC:1:1:4519:1350 length=76
TTTCCATTGCAGGTTTTAAAGTGGAGATTCTGAAGGGGAAAATAGGCACTGTCAGAACAAAGCTACCTGGAAACAG
+LNCaP.2622 GAPC:1:1:4519:1350 length=76
DD@:BBBBDDD@D:B::=:6:(6//;589444004':839>>2;;:':>>:7B:><B<B#################
@LNCaP.2624 GAPC:1:1:4794:1349 length=76

the regex ^@.*\\n\\s*\\n\\+.*\\n\\s*\\n recognize 4 lines I want to delete on regex101.com using the text above, however, when I run the command from my shell, the output is unchanged :(

I can't use the line number since this is an extract from a much much bigger file, which means that this has to be applied to all the 4 row instances that match that pattern.

any idea what am I doing wrong?

thanks

perl -pe does line by line processing. So using a regex that spans lines is never going to match by default.

You can change the input record separator $/ though, to slurp the entire file and apply the regex to it:

perl -pe "BEGIN { undef $/ } s/^@.*\n\s*\n+//mg" test.txt

The regex you suggested above doesn't provide the output you want though. To do that, you'd need the following expression:

perl -pe "BEGIN {undef $/} s/^@.*\n\s*\n(?:(?!\@).*\n)*//mg" text.txt

Outputs:

@LNCaP.2622 GAPC:1:1:4519:1350 length=76
TTTCCATTGCAGGTTTTAAAGTGGAGATTCTGAAGGGGAAAATAGGCACTGTCAGAACAAAGCTACCTGGAAACAG
+LNCaP.2622 GAPC:1:1:4519:1350 length=76
DD@:BBBBDDD@D:B::=:6:(6//;589444004':839>>2;;:':>>:7B:><B<B#################
@LNCaP.2624 GAPC:1:1:4794:1349 length=76

Miller is right in his answer. You have to read the whole content of the file to a variable and apply a regular expression to it. Try following code where I read the content in slurp mode and use a negative character class [^\\n]* to match each line and \\n{2,} to match blank lines:

#!/usr/bin/env perl

use strict;
use warnings;

my $text = do { undef $/; <DATA> };
$text =~ s/^@(?:[^\n]*\n{2,}){2}//mg;
print $text;


__DATA__
@LNCaP.2622 GAPC:1:1:4519:1350 length=76
TTTCCATTGCAGGTTTTAAAGTGGAGATTCTGAAGGGGAAAATAGGCACTGTCAGAACAAAGCTACCTGGAAACAG
+LNCaP.2622 GAPC:1:1:4519:1350 length=76
DD@:BBBBDDD@D:B::=:6:(6//;589444004':839>>2;;:':>>:7B:><B<B#################
@LNCaP.2623 GAPC:1:1:4767:1343 length=76

+LNCaP.2623 GAPC:1:1:4767:1343 length=76

@LNCaP.2624 GAPC:1:1:4794:1349 length=76

Run it like:

perl script.pl

That yields:

@LNCaP.2622 GAPC:1:1:4519:1350 length=76
TTTCCATTGCAGGTTTTAAAGTGGAGATTCTGAAGGGGAAAATAGGCACTGTCAGAACAAAGCTACCTGGAAACAG
+LNCaP.2622 GAPC:1:1:4519:1350 length=76
DD@:BBBBDDD@D:B::=:6:(6//;589444004':839>>2;;:':>>:7B:><B<B#################
@LNCaP.2624 GAPC:1:1:4794:1349 length=76

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM