简体   繁体   中英

Using sed/awk, I need to remove all lines in a file from the first occurrence of pattern1 up-to (but not including) the last occurrence of pattern2

Using sed/awk, I need to remove all lines in a file from the first occurrence of pattern1 up-to (but not including) the last occurrence of pattern2.

Consider the following text:

    <entity name="good">
    </entity>
    <entity name="bad">
    stuff to delete
    </entity>
    <entity name="bad">
    stuff to remove
    </entity>
    <entity name="bad2">
    </entity>
    <entity name="deleteMe2">
    </entity>
    <entity name="bad2">
    </entity>
    <entity name="good">
    </entity>

I would like the following outcome

<entity name="good">
</entity>
<entity name="bad2">
</entity>
<entity name="good">
</entity>

I know how to do a range in sed, but can't figure out how to match the last occurrence of 'bad2' and not include it in the delete. The below of course will not work as it will match the first bad2 and not remove the 'deleteme2' or 2nd occurrenc of 'bad2'.

sed -i '/<entity name="bad"/,/<entity name="bad2"/d' file.xml

There can be hundreds of 'bad'/'deleteMe2'/'bad2' lines in the file I am dealing with, so a simple line count won't work. I am fine if this is multiple commands (it does not have to be just a single one), but the more efficient the better because the file being modified can be quite large. As well, the -i is because I want to do an in place delete of the lines between.

NOTE: I am more familiar with SED than I am with AWK, but I am open to all the help I can get:)

This looks like XML to me, so I would strongly suggest that regex isn't the tool for the job. Use a parser instead:

#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;

my $twig = XML::Twig -> new -> parsefile ( 'your_file.xml' ) ;
$_ -> delete for $twig -> findnodes ( '//entity[@name="bad"]');
$twig -> set_pretty_print('indented_a');
$twig -> print;

Or perhaps more comprehensively:

for my $entity ( $twig -> findnodes ( '//entity') ) {
   if ( $entity -> att('name') eq "bad"
   or   $entity -> att('name') eq "deleteMe2" ) {
           $entity -> delete; 
   }
}

To delete only the first instance of 'bad2' you can just call findnodes once, and delete the first 'hit'.

$ cat tst.awk
NR==FNR {
    if (/"bad"/ && !begFnr) {
        begFnr = FNR
    }
    if (/"bad2"/) {
        endFnr = FNR
    }
    next
}
(FNR < begFnr) || (FNR >= endFnr)

$ awk -f tst.awk file file
<entity name="good">
</entity>
<entity name="bad2">
</entity>
<entity name="good">
</entity>

awk to the rescue!

$ awk 'NR==FNR&&/\"bad\"/&&!s{s=NR;next} 
          NR==FNR&&/\"bad2\"/{e=NR;next} 
          NR!=FNR && (FNR<s || FNR>=e)' xml{,}

    <entity name="good">
    </entity>
    <entity name="bad2">
    </entity>
    <entity name="good">
    </entity>

I guess can be simplified further. Two pass script to mark the line numbers first and print the second time.

This might work for you (GNU sed):

 sed '/bad/,$!b;/bad2/h;//!H;$!d;g;/bad2/!d' file

Lines that are not between bad and the end of the file, print as normal. Otherwise store those lines in the hold space overwriting those stored lines when matching bad2 . Delete all lines but the last, replacing it with the contents of the hold space. Delete the line unless it matches bad2 .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM