简体   繁体   中英

How to remove repeated lines from file with awk/sed?

I'm postprocessing a very large file which contains many frames. Occasionally there is an empty frame. I would like to remove these. For example,

file.txt

TIMESTEP
101
NUMBER OF ATOMS
3
ATOMS x y z
O 1 2 3
H 2 1 3
C 1 1 2
TIMESTEP
102
NUMBER OF ATOMS
3
ATOMS x y z
TIMESTEP
103
NUMBER OF ATOMS
3
ATOMS x y z
O -1 2 3
H  1 2 3
C  0 1 1
...

I would like to obtain

file.txt

TIMESTEP
101
NUMBER OF ATOMS
3
ATOMS x y z
O 1 2 3
H 2 1 3
C 1 1 2
TIMESTEP
103
NUMBER OF ATOMS
3
ATOMS x y z
O -1 2 3
H  1 2 3
C  0 1 1
...

I've tried

sed '/3.*/{:a;N;N;N;N;/.*NUMBER OF ATOMS$/d;ba}' file.txt

but that would remove also valid frames, which is not what I want. Any pointers and advice is highly appreciated!

This might work for you (GNU sed):

sed -n '/TIMESTEP/!{H;$!d};x;s/\n/&/5p' file

Gather up frames (records) in the hold space and only print them if they are 6 or more lines long.

This gnu awk may do:

awk -v RS=TIMESTEP  'NF>15 {print RS$0}' file
TIMESTEP
101
NUMBER OF ATOMS
3
ATOMS x y z
O 1 2 3
H 2 1 3
C 1 1 2

TIMESTEP
103
NUMBER OF ATOMS
3
ATOMS x y z
O -1 2 3
H  1 2 3
C  0 1 1
...

By setting record selector to TIMESTEP it works in block mode with each block start with TIMESTEP . Then count number of fields (may need to adjust). If its more than 15 (9 should be ok as a minimum), print the block

With GNU sed that would be just:

sed -z 's/TIMESTEP\n[0-9]*\nNUMBER OF ATOMS\n[0-9]*\nATOMS x y z\nTIMESTEP/TIMESTEP/g' file.txt

Without -z sed option, the following seems to work:

sed -n '
  # buffor 6 (not 5!, so one too much) lines into pattern space
    N;N;N;N;N

    : again

    # if pattern space matches empty frame
        /^TIMESTEP\n[0-9]*\nNUMBER OF ATOMS\n[0-9]*\nATOMS x y z\nTIMESTEP$/{
            # print just the next TIMESTEP
            s/.*/TIMESTEP/
            p
            # start from the top
            d
        }

        # if this is the last line
        ${
            # if last line is an empty frame
            /^[^\n]*\nTIMESTEP\n[0-9]*\nNUMBER OF ATOMS\n[0-9]*\nATOMS x y z$/{
                # print the line we have too much
                P
                # and end it
                d
            }

            # print until end of line
            p
            d
        }

    # just print and delete one line
        P
        s/^[^\n]*\n//
        # read next line
        N

    b again

'

with gnu awk :

awk '{a[i++]=$0}END{ for(i=0;i<NR;)if(a[i]=="TIMESTEP" && a[i+5]=="TIMESTEP") {i=i+5;} else {print a[i]; i=i+1;} }' file

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM