How to remove repeated lines from file with awk/sed?

Question

I'm postprocessing a very large file which contains many frames. Occasionally there is an empty frame. I would like to remove these. For example,

file.txt

TIMESTEP
101
NUMBER OF ATOMS
3
ATOMS x y z
O 1 2 3
H 2 1 3
C 1 1 2
TIMESTEP
102
NUMBER OF ATOMS
3
ATOMS x y z
TIMESTEP
103
NUMBER OF ATOMS
3
ATOMS x y z
O -1 2 3
H  1 2 3
C  0 1 1
...

I would like to obtain

file.txt

TIMESTEP
101
NUMBER OF ATOMS
3
ATOMS x y z
O 1 2 3
H 2 1 3
C 1 1 2
TIMESTEP
103
NUMBER OF ATOMS
3
ATOMS x y z
O -1 2 3
H  1 2 3
C  0 1 1
...

I've tried

sed '/3.*/{:a;N;N;N;N;/.*NUMBER OF ATOMS$/d;ba}' file.txt

but that would remove also valid frames, which is not what I want. Any pointers and advice is highly appreciated!

Answer 1

This might work for you (GNU sed):

sed -n '/TIMESTEP/!{H;$!d};x;s/\n/&/5p' file

Gather up frames (records) in the hold space and only print them if they are 6 or more lines long.

Answer 2

This gnu awk may do:

awk -v RS=TIMESTEP  'NF>15 {print RS$0}' file
TIMESTEP
101
NUMBER OF ATOMS
3
ATOMS x y z
O 1 2 3
H 2 1 3
C 1 1 2

TIMESTEP
103
NUMBER OF ATOMS
3
ATOMS x y z
O -1 2 3
H  1 2 3
C  0 1 1
...

By setting record selector to TIMESTEP it works in block mode with each block start with TIMESTEP . Then count number of fields (may need to adjust). If its more than 15 (9 should be ok as a minimum), print the block

Answer 3

With GNU sed that would be just:

sed -z 's/TIMESTEP\n[0-9]*\nNUMBER OF ATOMS\n[0-9]*\nATOMS x y z\nTIMESTEP/TIMESTEP/g' file.txt

Without -z sed option, the following seems to work:

sed -n '
  # buffor 6 (not 5!, so one too much) lines into pattern space
    N;N;N;N;N

    : again

    # if pattern space matches empty frame
        /^TIMESTEP\n[0-9]*\nNUMBER OF ATOMS\n[0-9]*\nATOMS x y z\nTIMESTEP$/{
            # print just the next TIMESTEP
            s/.*/TIMESTEP/
            p
            # start from the top
            d
        }

        # if this is the last line
        ${
            # if last line is an empty frame
            /^[^\n]*\nTIMESTEP\n[0-9]*\nNUMBER OF ATOMS\n[0-9]*\nATOMS x y z$/{
                # print the line we have too much
                P
                # and end it
                d
            }

            # print until end of line
            p
            d
        }

    # just print and delete one line
        P
        s/^[^\n]*\n//
        # read next line
        N

    b again

'

Answer 4

with gnu awk :

awk '{a[i++]=$0}END{ for(i=0;i<NR;)if(a[i]=="TIMESTEP" && a[i+5]=="TIMESTEP") {i=i+5;} else {print a[i]; i=i+1;} }' file

How to remove repeated lines from file with awk/sed?

Question

4 answers

solution1
2 ACCPTED 2019-11-01 23:18:27

solution2
1 2019-11-01 22:59:59

solution3
1 2019-11-01 23:04:12

solution4
0 2019-11-01 23:13:23

How to remove repeated lines from file with awk/sed?

Question

4 answers

solution1 2 ACCPTED 2019-11-01 23:18:27

solution2 1 2019-11-01 22:59:59

solution3 1 2019-11-01 23:04:12

solution4 0 2019-11-01 23:13:23

solution1
2 ACCPTED 2019-11-01 23:18:27

solution2
1 2019-11-01 22:59:59

solution3
1 2019-11-01 23:04:12

solution4
0 2019-11-01 23:13:23