如何使用awk替换所有组合中的不同文本块？

Question

I'm trying to replace blocks of lines like this pattern: 我正在尝试替换像这种模式的行块：

A block of lines is formed by the lines bellow which has an minor number. 由下面的线形成一条线，其具有次要编号。
When a line has the "=", then this block of lines could replace the block named after the "=" 当一行有“=”时，那么这一行可以替换以“=”命名的块

Let's see an example, this input: 让我们看一个例子，这个输入：

01 hello
    02 stack
    02 overflow
        04 hi
    02 friends = overflow
        03 this
        03 is 
        03 my = is
        03 life
    02 lol
    02 im
    02 joking = im
        03 filler

Would generate the following ouput (each hello block is one element of an array): 将生成以下输出（每个hello块是数组的一个元素）：

01 hello
    02 stack
    02 overflow
        04 hi
    02 lol
    02 im

01 hello
    02 stack
    02 overflow
        04 hi
    02 lol
    02 joking = im
        03 filler

01 hello
    02 stack
    02 friends = overflow
        03 this
        03 is 
        03 life
    02 lol
    02 im

01 hello
    02 stack
    02 friends = overflow
        03 this
        03 is 
        03 life
    02 lol
    02 joking = im
        03 filler

01 hello
    02 stack
    02 friends = overflow
        03 this
        03 my = is
        03 life
    02 lol
    02 im

01 hello
    02 stack
    02 friends = overflow
        03 this
        03 my = is
        03 life
    02 lol
    02 joking = im
        03 filler

I tried it by this way: 我通过这种方式尝试了它：

#!/bin/bash

awk '{

    if ($0~/=/){
      level=$1
      oc=1
    }else if (oc && $1<=level){
        oc=0
    }

    if (!oc){
        print
    }

}' input.txt

But it only returns the first output that I need, and I don't know how to skip the 03 life word which are within friends . 但它只返回我需要的第一个输出，我不知道如何跳过friends内的03 life词。

How could I generate these outputs? 我怎么能产生这些输出？

I wouldn't mind a python or perl solution if is more confortable to you. 我不介意python或perl解决方案，如果你更舒适。

Answer 1

Here is a python script to read the cobol input file and print out all the possible combinations of defined and redefined variables: 这是一个python脚本，用于读取cobol输入文件并打印出已定义和重新定义的变量的所有可能组合：

#!/usr/bin/python
"""Read cobol file and print all possible redefines."""
import sys
from itertools import product

def readfile(fname):
    """Read cobol file & return a master list of lines and namecount of redefined lines."""
    master = []
    namecount = {}
    with open(fname) as f:
        for line in f:
            line = line.rstrip(' .\t\n')
            if not line:
                continue
            words = line.split()
            n = int(words[0])
            if '=' in words or 'REDEFINES' in words:
                name = words[3]
            else:
                name = words[1]
            master.append((n, name, line))
            namecount[name] = namecount.get(name, 0) + 1
    # py2.7: namecount = {key: val for key, val in namecount.items() if val > 1}
    namecount = dict((key, val) for key, val in namecount.items() if val > 1)

    return master, namecount

def compute(master, skip=None):
    """Return new cobol file given master and skip parameters."""
    if skip is None:
        skip = {}
    seen = {}
    skip_to = None
    output = ''
    for n, name, line in master:
        if skip_to and n > skip_to:
            continue
        seen[name] = seen.get(name, 0) + 1
        if seen[name] != skip.get(name, 1):
            skip_to = n
            continue
        skip_to = None
        output += line + '\n' 
    return output

def find_all(master, namecount):
    """Return list of all possible output files given master and namecount."""
    keys = namecount.keys()
    values = [namecount[k] for k in keys]
    out = []
    for combo in product(*[range(1, v + 1) for v in values]):
        skip = dict(zip(keys, combo))
        new = compute(master, skip=skip)
        if new not in out:
            out.append(new)
    return out

def main(argv):
    """Process command line arguments and print results."""
    fname = argv[-1]
    master, namecount = readfile(fname)
    out = find_all(master, namecount)
    print('\n'.join(out))

if __name__ == '__main__':
    main(sys.argv)

If the above script is save in a file called cobol.py , then if can be run as: 如果以上脚本保存在名为cobol.py的文件中，则可以按以下方式运行：

python cobol.py name_of_input_file

The various possible combinations of defines and redefines will be displayed on stdout. 定义和重新定义的各种可能组合将显示在stdout上。

This script runs under either python2 (2.6+) or python3. 此脚本在python2（2.6+）或python3下运行。

Explanation 说明

The code uses three functions: 该代码使用三个函数：

readfile reads the input file and returns two variables that summarize the structure of what is in it. readfile读取输入文件并返回两个变量，这些变量概括了其中的结构。
compute takes two parameters and, from them, computes an output block. compute接受两个参数，然后从中计算出一个输出块。
find_all determines all the possible output blocks, uses compute to create them, and then returns them as a list. find_all确定所有可能的输出块，使用compute创建它们，然后将它们作为列表返回。

Let's look at each function in more detail: 让我们更详细地看一下每个函数：

readfile

readfile takes the input file name as an argument and returns a list, master , and a dictionary, namecount . readfile将输入文件名作为参数，并返回列表master和字典namecount 。 For every non-empty line in the input file, the list master has a tuple containing (1) the level number, (2) the name that is defined or redefined, and (2) the original line itself. 对于输入文件中的每个非空行，列表master文件都有一个元组，其中包含（1）级别编号，（2）定义或重新定义的名称，以及（2）原始行本身。 For the sample input file, readfile returns this value for master : 对于样本输入文件， readfile为master返回此值：

[(1, 'hello', '01 hello'),
 (2, 'stack', '    02 stack'),
 (2, 'overflow', '    02 overflow'),
 (4, 'hi', '        04 hi'),
 (2, 'overflow', '    02 friends = overflow'),
 (3, 'this', '        03 this'),
 (3, 'is', '        03 is'),
 (3, 'is', '        03 my = is'),
 (3, 'life', '        03 life'),
 (2, 'lol', '    02 lol'),
 (2, 'im', '    02 im'),
 (2, 'im', '    02 joking = im'),
 (3, 'filler', '        03 filler')]

readfile also returns the dictionary namecount which has an entry for every name that gets redefined and has a count of how many definitions/redefinitions there are for that name. readfile还返回字典namecount ，其中包含每个重新定义的名称的条目，并且具有该名称的定义/重新定义的数量。 For the sample input file, namecount has the value: 对于示例输入文件， namecount具有以下值：

{'im': 2, 'is': 2, 'overflow': 2}

This indicates that im , is , and overflow each have two possible values. 这表明im ， is和overflow都有两个可能的值。

readfile was of course designed to work with the input file format in the current version of the question. readfile当然是设计用于在当前版本的问题中使用输入文件格式。 To the extent possible, it was also designed to work with the formats from the previous versions of this question. 在可能的情况下，它还被设计为使用该问题先前版本中的格式。 For example, variable redefinitions are accepted whether they are signaled with an equal sign (current version) or with the word REFDEFINES as in previous versions. 例如，无论是使用等号（当前版本）还是使用与先前版本中的单词REFDEFINES一起发信号，都可以接受变量重新定义。 This is intended to make this script as flexible as possible. 这旨在使此脚本尽可能灵活。

compute

The function compute is what generates each output block. 函数compute是生成每个输出块的函数。 It uses two parameters. 它使用两个参数。 The first is master which comes directly from readfile . 第一个是master ，直接来自readfile 。 The second is skip which is derived from the namecount dictionary that was returned by readfile . 第二个是skip ，它是从readfile返回的namecount字典派生的。 For example, the namecount dictionary says that there are two possible definitions for im . 例如， namecount字典表示im有两种可能的定义。 This shows how compute can be used to generate the output block for each: 这显示了如何使用compute为每个生成输出块：

In [14]: print compute(master, skip={'im':1, 'is':1, 'overflow':1})
01 hello
    02 stack
    02 overflow
        04 hi
    02 lol
    02 im

In [15]: print compute(master, skip={'im':2, 'is':1, 'overflow':1})
01 hello
    02 stack
    02 overflow
        04 hi
    02 lol
    02 joking = im
        03 filler

Observe that the first call to compute above generated the block that uses the first definition of im and the second call generated the block that uses the second definition. 观察到上面第一次compute调用生成了使用im的第一个定义的块，第二个调用生成了使用第二个定义的块。

find_all

With the above two functions available, it is clear that the last step is just to generate all the different combinations of definitions and print them out. 有了上述两个功能，很明显最后一步只是生成所有不同的定义组合并将其打印出来。 That is what the function find_all does. 这就是find_all函数的功能。 Using master and namecount as returned by readfile , it systematic runs through all the available combinations of definitions and calls compute to create a block for each one. 使用readfile返回的master和namecount ，它可以系统地遍历所有可用的定义组合，并调用compute为每个块创建一个块。 It gathers up all the unique blocks that can be created this way and returns them. 它收集所有可以通过这种方式创建的独特块并返回它们。

The output returned by find_all is a list of strings. find_all返回的输出是一个字符串列表。 Each strings is the block which corresponds to one combination of defines/redefines. 每个字符串都是对应于define / redefines的一个组合的块。 Using the sample input from the question, this shows what find_all returns: 使用问题中的示例输入，显示find_all返回的内容：

In [16]: find_all(master, namecount)
Out[16]: 
['01 hello\n    02 stack\n    02 overflow\n        04 hi\n    02 lol\n    02 im\n',
 '01 hello\n    02 stack\n    02 friends = overflow\n        03 this\n        03 is\n        03 life\n    02 lol\n    02 im\n',
 '01 hello\n    02 stack\n    02 overflow\n        04 hi\n    02 lol\n    02 joking = im\n        03 filler\n',
 '01 hello\n    02 stack\n    02 friends = overflow\n        03 this\n        03 is\n        03 life\n    02 lol\n    02 joking = im\n        03 filler\n',
 '01 hello\n    02 stack\n    02 friends = overflow\n        03 this\n        03 my = is\n        03 life\n    02 lol\n    02 im\n',
 '01 hello\n    02 stack\n    02 friends = overflow\n        03 this\n        03 my = is\n        03 life\n    02 lol\n    02 joking = im\n        03 filler\n']

As an example, let's take the fourth string returned by find_all and, for better format, we will print it: 例如，让我们以find_all返回的第四个字符串find_all ，为了更好的格式，我们将其print出来：

In [18]: print find_all(master, namecount)[3]
01 hello
    02 stack
    02 friends = overflow
        03 this
        03 is
        03 life
    02 lol
    02 joking = im
        03 filler

In the complete script, the output from find_all is combined together and printed to stdout as follows: 在完整的脚本中， find_all的输出组合在一起并打印到stdout，如下所示：

out = find_all(master, namecount)              
print('\n'.join(out))

In this way, the output displays all possible blocks. 这样，输出显示所有可能的块。

Answers for Earlier Versions of the Question 问题的早期版本的答案

Answer for Original Question 回答原始问题

awk 'f==0 && !/REDEFINES/{s=s"\n"$0;next} /REDEFINES/{f=1;print s t>("output" ++c ".txt");t=""} {t=t"\n"$0} END{print s t>("output" ++c ".txt")}' input

Explanation: 说明：

This program has the following variables: 该程序具有以下变量：

f is a flag which is zero before the first REDEFINE and one thereafter. f是在第一个REDEFINE之前为零的标志，之后是一个标志。
s contains all the text up to the first REDEFINE. s包含第一个REDEFINE之前的所有文本。
t contains the text of the current REDEFINE. t包含当前REDEFINE的文本。
c is a counter which is used to determine the name of the output name. c是一个计数器，用于确定输出名称的名称。

The code works as follows: 代码的工作原理如下：

f==0 && !/REDEFINES/{s=s"\\n"$0;next}

Before the first redefine is encountered, the text is saved in the variable s and we skip the rest of the commands and jump to the next line. 在遇到第一次重新定义之前，文本保存在变量s ，我们跳过其余的命令并跳转到next行。
/REDEFINES/{f=1;print s t>("output" ++c ".txt");t=""}

Every time that we encounter a REDEFINE line, we set the flag f to one and print the prolog section s along with the current REDEFINE section to a file named outputn.txt where n is replaced by the value of the counter c . 每次遇到REDEFINE行时，我们将标志f设置为1并将prolog部分s与当前REDEFINE部分一起打印到名为outputn.txt的文件中，其中n由计数器c的值替换。
Because we are at the start of a new REDEFINE section, the variable t is set to empty. 因为我们处于新的REDEFINE节的开始，所以变量t设置为空。
{t=t"\\n"$0}

Save the current line of this REDEFINE to the variable t . 将此REDEFINE的当前行保存到变量t 。
END{print s t>("output" ++c ".txt")}

The output file for the last REDEFINE section is printed. 打印最后一个REDEFINE部分的输出文件。

A Minor Improvement 一个小改进

Each of the output files produced by the code above has a leading blank line. 上面代码生成的每个输出文件都有一个前导空白行。 The code below removes that via the awk substr function: 下面的代码通过awk substr函数删除：

awk '/REDEFINES/{f=1;print substr(s,2) t>("output" ++c ".txt");t=""} f==0 {s=s"\n"$0;next} {t=t"\n"$0} END{print substr(s,2) t>("output" ++c ".txt")}' input

For variety, this version has slightly different logic but, otherwise, achieves the same result. 对于多样性，此版本的逻辑略有不同，但是，否则会获得相同的结果。

Answer for Revised Question 回答修订问题

awk 'f==1 && pre==$1 && !/REDEFINES/{tail=tail "\n" $0} /REDEFINES/{pre=$1;f=1;t[++c]="\n"$0} f==0 {head=head"\n"$0;next} pre!=$1{t[c]=t[c]"\n"$0} END{for (i=0;i<=c;i++) {print head t[i] tail>("output" (i+1) ".txt")}}' file

如何使用awk替换所有组合中的不同文本块？

问题描述

1 个解决方案

解决方案1
7 已采纳 2014-10-04 20:13:45

Explanation 说明

Answers for Earlier Versions of the Question 问题的早期版本的答案

Answer for Original Question 回答原始问题

Explanation: 说明：

A Minor Improvement 一个小改进

Answer for Revised Question 回答修订问题

如何使用awk替换所有组合中的不同文本块？

问题描述

1 个解决方案

解决方案1 7 已采纳 2014-10-04 20:13:45

Explanation 说明

Answers for Earlier Versions of the Question 问题的早期版本的答案

Answer for Original Question 回答原始问题

Explanation: 说明：

A Minor Improvement 一个小改进

Answer for Revised Question 回答修订问题

解决方案1
7 已采纳 2014-10-04 20:13:45