How to replace different blocks of text in all combinations using awk?

Question

I'm trying to replace blocks of lines like this pattern:

A block of lines is formed by the lines bellow which has an minor number.
When a line has the "=", then this block of lines could replace the block named after the "="

Let's see an example, this input:

01 hello
    02 stack
    02 overflow
        04 hi
    02 friends = overflow
        03 this
        03 is 
        03 my = is
        03 life
    02 lol
    02 im
    02 joking = im
        03 filler

Would generate the following ouput (each hello block is one element of an array):

01 hello
    02 stack
    02 overflow
        04 hi
    02 lol
    02 im

01 hello
    02 stack
    02 overflow
        04 hi
    02 lol
    02 joking = im
        03 filler

01 hello
    02 stack
    02 friends = overflow
        03 this
        03 is 
        03 life
    02 lol
    02 im

01 hello
    02 stack
    02 friends = overflow
        03 this
        03 is 
        03 life
    02 lol
    02 joking = im
        03 filler

01 hello
    02 stack
    02 friends = overflow
        03 this
        03 my = is
        03 life
    02 lol
    02 im

01 hello
    02 stack
    02 friends = overflow
        03 this
        03 my = is
        03 life
    02 lol
    02 joking = im
        03 filler

I tried it by this way:

#!/bin/bash

awk '{

    if ($0~/=/){
      level=$1
      oc=1
    }else if (oc && $1<=level){
        oc=0
    }

    if (!oc){
        print
    }

}' input.txt

But it only returns the first output that I need, and I don't know how to skip the 03 life word which are within friends .

How could I generate these outputs?

I wouldn't mind a python or perl solution if is more confortable to you.

Answer 1

Here is a python script to read the cobol input file and print out all the possible combinations of defined and redefined variables:

#!/usr/bin/python
"""Read cobol file and print all possible redefines."""
import sys
from itertools import product

def readfile(fname):
    """Read cobol file & return a master list of lines and namecount of redefined lines."""
    master = []
    namecount = {}
    with open(fname) as f:
        for line in f:
            line = line.rstrip(' .\t\n')
            if not line:
                continue
            words = line.split()
            n = int(words[0])
            if '=' in words or 'REDEFINES' in words:
                name = words[3]
            else:
                name = words[1]
            master.append((n, name, line))
            namecount[name] = namecount.get(name, 0) + 1
    # py2.7: namecount = {key: val for key, val in namecount.items() if val > 1}
    namecount = dict((key, val) for key, val in namecount.items() if val > 1)

    return master, namecount

def compute(master, skip=None):
    """Return new cobol file given master and skip parameters."""
    if skip is None:
        skip = {}
    seen = {}
    skip_to = None
    output = ''
    for n, name, line in master:
        if skip_to and n > skip_to:
            continue
        seen[name] = seen.get(name, 0) + 1
        if seen[name] != skip.get(name, 1):
            skip_to = n
            continue
        skip_to = None
        output += line + '\n' 
    return output

def find_all(master, namecount):
    """Return list of all possible output files given master and namecount."""
    keys = namecount.keys()
    values = [namecount[k] for k in keys]
    out = []
    for combo in product(*[range(1, v + 1) for v in values]):
        skip = dict(zip(keys, combo))
        new = compute(master, skip=skip)
        if new not in out:
            out.append(new)
    return out

def main(argv):
    """Process command line arguments and print results."""
    fname = argv[-1]
    master, namecount = readfile(fname)
    out = find_all(master, namecount)
    print('\n'.join(out))

if __name__ == '__main__':
    main(sys.argv)

If the above script is save in a file called cobol.py , then if can be run as:

python cobol.py name_of_input_file

The various possible combinations of defines and redefines will be displayed on stdout.

This script runs under either python2 (2.6+) or python3.

Explanation

The code uses three functions:

readfile reads the input file and returns two variables that summarize the structure of what is in it.
compute takes two parameters and, from them, computes an output block.
find_all determines all the possible output blocks, uses compute to create them, and then returns them as a list.

Let's look at each function in more detail:

readfile

readfile takes the input file name as an argument and returns a list, master , and a dictionary, namecount . For every non-empty line in the input file, the list master has a tuple containing (1) the level number, (2) the name that is defined or redefined, and (2) the original line itself. For the sample input file, readfile returns this value for master :

[(1, 'hello', '01 hello'),
 (2, 'stack', '    02 stack'),
 (2, 'overflow', '    02 overflow'),
 (4, 'hi', '        04 hi'),
 (2, 'overflow', '    02 friends = overflow'),
 (3, 'this', '        03 this'),
 (3, 'is', '        03 is'),
 (3, 'is', '        03 my = is'),
 (3, 'life', '        03 life'),
 (2, 'lol', '    02 lol'),
 (2, 'im', '    02 im'),
 (2, 'im', '    02 joking = im'),
 (3, 'filler', '        03 filler')]

readfile also returns the dictionary namecount which has an entry for every name that gets redefined and has a count of how many definitions/redefinitions there are for that name. For the sample input file, namecount has the value:

{'im': 2, 'is': 2, 'overflow': 2}

This indicates that im , is , and overflow each have two possible values.

readfile was of course designed to work with the input file format in the current version of the question. To the extent possible, it was also designed to work with the formats from the previous versions of this question. For example, variable redefinitions are accepted whether they are signaled with an equal sign (current version) or with the word REFDEFINES as in previous versions. This is intended to make this script as flexible as possible.

compute

The function compute is what generates each output block. It uses two parameters. The first is master which comes directly from readfile . The second is skip which is derived from the namecount dictionary that was returned by readfile . For example, the namecount dictionary says that there are two possible definitions for im . This shows how compute can be used to generate the output block for each:

In [14]: print compute(master, skip={'im':1, 'is':1, 'overflow':1})
01 hello
    02 stack
    02 overflow
        04 hi
    02 lol
    02 im

In [15]: print compute(master, skip={'im':2, 'is':1, 'overflow':1})
01 hello
    02 stack
    02 overflow
        04 hi
    02 lol
    02 joking = im
        03 filler

Observe that the first call to compute above generated the block that uses the first definition of im and the second call generated the block that uses the second definition.

find_all

With the above two functions available, it is clear that the last step is just to generate all the different combinations of definitions and print them out. That is what the function find_all does. Using master and namecount as returned by readfile , it systematic runs through all the available combinations of definitions and calls compute to create a block for each one. It gathers up all the unique blocks that can be created this way and returns them.

The output returned by find_all is a list of strings. Each strings is the block which corresponds to one combination of defines/redefines. Using the sample input from the question, this shows what find_all returns:

In [16]: find_all(master, namecount)
Out[16]: 
['01 hello\n    02 stack\n    02 overflow\n        04 hi\n    02 lol\n    02 im\n',
 '01 hello\n    02 stack\n    02 friends = overflow\n        03 this\n        03 is\n        03 life\n    02 lol\n    02 im\n',
 '01 hello\n    02 stack\n    02 overflow\n        04 hi\n    02 lol\n    02 joking = im\n        03 filler\n',
 '01 hello\n    02 stack\n    02 friends = overflow\n        03 this\n        03 is\n        03 life\n    02 lol\n    02 joking = im\n        03 filler\n',
 '01 hello\n    02 stack\n    02 friends = overflow\n        03 this\n        03 my = is\n        03 life\n    02 lol\n    02 im\n',
 '01 hello\n    02 stack\n    02 friends = overflow\n        03 this\n        03 my = is\n        03 life\n    02 lol\n    02 joking = im\n        03 filler\n']

As an example, let's take the fourth string returned by find_all and, for better format, we will print it:

In [18]: print find_all(master, namecount)[3]
01 hello
    02 stack
    02 friends = overflow
        03 this
        03 is
        03 life
    02 lol
    02 joking = im
        03 filler

In the complete script, the output from find_all is combined together and printed to stdout as follows:

out = find_all(master, namecount)              
print('\n'.join(out))

In this way, the output displays all possible blocks.

Answers for Earlier Versions of the Question

Answer for Original Question

awk 'f==0 && !/REDEFINES/{s=s"\n"$0;next} /REDEFINES/{f=1;print s t>("output" ++c ".txt");t=""} {t=t"\n"$0} END{print s t>("output" ++c ".txt")}' input

Explanation:

This program has the following variables:

f is a flag which is zero before the first REDEFINE and one thereafter.
s contains all the text up to the first REDEFINE.
t contains the text of the current REDEFINE.
c is a counter which is used to determine the name of the output name.

The code works as follows:

f==0 && !/REDEFINES/{s=s"\\n"$0;next}

Before the first redefine is encountered, the text is saved in the variable s and we skip the rest of the commands and jump to the next line.
/REDEFINES/{f=1;print s t>("output" ++c ".txt");t=""}

Every time that we encounter a REDEFINE line, we set the flag f to one and print the prolog section s along with the current REDEFINE section to a file named outputn.txt where n is replaced by the value of the counter c .
Because we are at the start of a new REDEFINE section, the variable t is set to empty.
{t=t"\\n"$0}

Save the current line of this REDEFINE to the variable t .
END{print s t>("output" ++c ".txt")}

The output file for the last REDEFINE section is printed.

A Minor Improvement

Each of the output files produced by the code above has a leading blank line. The code below removes that via the awk substr function:

awk '/REDEFINES/{f=1;print substr(s,2) t>("output" ++c ".txt");t=""} f==0 {s=s"\n"$0;next} {t=t"\n"$0} END{print substr(s,2) t>("output" ++c ".txt")}' input

For variety, this version has slightly different logic but, otherwise, achieves the same result.

Answer for Revised Question

awk 'f==1 && pre==$1 && !/REDEFINES/{tail=tail "\n" $0} /REDEFINES/{pre=$1;f=1;t[++c]="\n"$0} f==0 {head=head"\n"$0;next} pre!=$1{t[c]=t[c]"\n"$0} END{for (i=0;i<=c;i++) {print head t[i] tail>("output" (i+1) ".txt")}}' file

How to replace different blocks of text in all combinations using awk?

Question

1 answers

solution1
7 ACCPTED 2014-10-04 20:13:45

Explanation

Answers for Earlier Versions of the Question

Answer for Original Question

Explanation:

A Minor Improvement

Answer for Revised Question

How to replace different blocks of text in all combinations using awk?

Question

1 answers

solution1 7 ACCPTED 2014-10-04 20:13:45

Explanation

Answers for Earlier Versions of the Question

Answer for Original Question

Explanation:

A Minor Improvement

Answer for Revised Question

solution1
7 ACCPTED 2014-10-04 20:13:45