如何使用awk替換所有組合中的不同文本塊？

Question

我正在嘗試替換像這種模式的行塊：

由下面的線形成一條線，其具有次要編號。
當一行有“=”時，那么這一行可以替換以“=”命名的塊

讓我們看一個例子，這個輸入：

01 hello
    02 stack
    02 overflow
        04 hi
    02 friends = overflow
        03 this
        03 is 
        03 my = is
        03 life
    02 lol
    02 im
    02 joking = im
        03 filler

將生成以下輸出（每個hello塊是數組的一個元素）：

01 hello
    02 stack
    02 overflow
        04 hi
    02 lol
    02 im

01 hello
    02 stack
    02 overflow
        04 hi
    02 lol
    02 joking = im
        03 filler

01 hello
    02 stack
    02 friends = overflow
        03 this
        03 is 
        03 life
    02 lol
    02 im

01 hello
    02 stack
    02 friends = overflow
        03 this
        03 is 
        03 life
    02 lol
    02 joking = im
        03 filler

01 hello
    02 stack
    02 friends = overflow
        03 this
        03 my = is
        03 life
    02 lol
    02 im

01 hello
    02 stack
    02 friends = overflow
        03 this
        03 my = is
        03 life
    02 lol
    02 joking = im
        03 filler

我通過這種方式嘗試了它：

#!/bin/bash

awk '{

    if ($0~/=/){
      level=$1
      oc=1
    }else if (oc && $1<=level){
        oc=0
    }

    if (!oc){
        print
    }

}' input.txt

但它只返回我需要的第一個輸出，我不知道如何跳過friends內的03 life詞。

我怎么能產生這些輸出？

我不介意python或perl解決方案，如果你更舒適。

Answer 1

這是一個python腳本，用於讀取cobol輸入文件並打印出已定義和重新定義的變量的所有可能組合：

#!/usr/bin/python
"""Read cobol file and print all possible redefines."""
import sys
from itertools import product

def readfile(fname):
    """Read cobol file & return a master list of lines and namecount of redefined lines."""
    master = []
    namecount = {}
    with open(fname) as f:
        for line in f:
            line = line.rstrip(' .\t\n')
            if not line:
                continue
            words = line.split()
            n = int(words[0])
            if '=' in words or 'REDEFINES' in words:
                name = words[3]
            else:
                name = words[1]
            master.append((n, name, line))
            namecount[name] = namecount.get(name, 0) + 1
    # py2.7: namecount = {key: val for key, val in namecount.items() if val > 1}
    namecount = dict((key, val) for key, val in namecount.items() if val > 1)

    return master, namecount

def compute(master, skip=None):
    """Return new cobol file given master and skip parameters."""
    if skip is None:
        skip = {}
    seen = {}
    skip_to = None
    output = ''
    for n, name, line in master:
        if skip_to and n > skip_to:
            continue
        seen[name] = seen.get(name, 0) + 1
        if seen[name] != skip.get(name, 1):
            skip_to = n
            continue
        skip_to = None
        output += line + '\n' 
    return output

def find_all(master, namecount):
    """Return list of all possible output files given master and namecount."""
    keys = namecount.keys()
    values = [namecount[k] for k in keys]
    out = []
    for combo in product(*[range(1, v + 1) for v in values]):
        skip = dict(zip(keys, combo))
        new = compute(master, skip=skip)
        if new not in out:
            out.append(new)
    return out

def main(argv):
    """Process command line arguments and print results."""
    fname = argv[-1]
    master, namecount = readfile(fname)
    out = find_all(master, namecount)
    print('\n'.join(out))

if __name__ == '__main__':
    main(sys.argv)

如果以上腳本保存在名為cobol.py的文件中，則可以按以下方式運行：

python cobol.py name_of_input_file

定義和重新定義的各種可能組合將顯示在stdout上。

此腳本在python2（2.6+）或python3下運行。

說明

該代碼使用三個函數：

readfile讀取輸入文件並返回兩個變量，這些變量概括了其中的結構。
compute接受兩個參數，然后從中計算出一個輸出塊。
find_all確定所有可能的輸出塊，使用compute創建它們，然后將它們作為列表返回。

讓我們更詳細地看一下每個函數：

readfile

readfile將輸入文件名作為參數，並返回列表master和字典namecount 。 對於輸入文件中的每個非空行，列表master文件都有一個元組，其中包含（1）級別編號，（2）定義或重新定義的名稱，以及（2）原始行本身。 對於樣本輸入文件， readfile為master返回此值：

[(1, 'hello', '01 hello'),
 (2, 'stack', '    02 stack'),
 (2, 'overflow', '    02 overflow'),
 (4, 'hi', '        04 hi'),
 (2, 'overflow', '    02 friends = overflow'),
 (3, 'this', '        03 this'),
 (3, 'is', '        03 is'),
 (3, 'is', '        03 my = is'),
 (3, 'life', '        03 life'),
 (2, 'lol', '    02 lol'),
 (2, 'im', '    02 im'),
 (2, 'im', '    02 joking = im'),
 (3, 'filler', '        03 filler')]

readfile還返回字典namecount ，其中包含每個重新定義的名稱的條目，並且具有該名稱的定義/重新定義的數量。 對於示例輸入文件， namecount具有以下值：

{'im': 2, 'is': 2, 'overflow': 2}

這表明im ， is和overflow都有兩個可能的值。

readfile當然是設計用於在當前版本的問題中使用輸入文件格式。 在可能的情況下，它還被設計為使用該問題先前版本中的格式。 例如，無論是使用等號（當前版本）還是使用與先前版本中的單詞REFDEFINES一起發信號，都可以接受變量重新定義。 這旨在使此腳本盡可能靈活。

compute

函數compute是生成每個輸出塊的函數。 它使用兩個參數。 第一個是master ，直接來自readfile 。 第二個是skip ，它是從readfile返回的namecount字典派生的。 例如， namecount字典表示im有兩種可能的定義。 這顯示了如何使用compute為每個生成輸出塊：

In [14]: print compute(master, skip={'im':1, 'is':1, 'overflow':1})
01 hello
    02 stack
    02 overflow
        04 hi
    02 lol
    02 im

In [15]: print compute(master, skip={'im':2, 'is':1, 'overflow':1})
01 hello
    02 stack
    02 overflow
        04 hi
    02 lol
    02 joking = im
        03 filler

觀察到上面第一次compute調用生成了使用im的第一個定義的塊，第二個調用生成了使用第二個定義的塊。

find_all

有了上述兩個功能，很明顯最后一步只是生成所有不同的定義組合並將其打印出來。 這就是find_all函數的功能。 使用readfile返回的master和namecount ，它可以系統地遍歷所有可用的定義組合，並調用compute為每個塊創建一個塊。 它收集所有可以通過這種方式創建的獨特塊並返回它們。

find_all返回的輸出是一個字符串列表。 每個字符串都是對應於define / redefines的一個組合的塊。 使用問題中的示例輸入，顯示find_all返回的內容：

In [16]: find_all(master, namecount)
Out[16]: 
['01 hello\n    02 stack\n    02 overflow\n        04 hi\n    02 lol\n    02 im\n',
 '01 hello\n    02 stack\n    02 friends = overflow\n        03 this\n        03 is\n        03 life\n    02 lol\n    02 im\n',
 '01 hello\n    02 stack\n    02 overflow\n        04 hi\n    02 lol\n    02 joking = im\n        03 filler\n',
 '01 hello\n    02 stack\n    02 friends = overflow\n        03 this\n        03 is\n        03 life\n    02 lol\n    02 joking = im\n        03 filler\n',
 '01 hello\n    02 stack\n    02 friends = overflow\n        03 this\n        03 my = is\n        03 life\n    02 lol\n    02 im\n',
 '01 hello\n    02 stack\n    02 friends = overflow\n        03 this\n        03 my = is\n        03 life\n    02 lol\n    02 joking = im\n        03 filler\n']

例如，讓我們以find_all返回的第四個字符串find_all ，為了更好的格式，我們將其print出來：

In [18]: print find_all(master, namecount)[3]
01 hello
    02 stack
    02 friends = overflow
        03 this
        03 is
        03 life
    02 lol
    02 joking = im
        03 filler

在完整的腳本中， find_all的輸出組合在一起並打印到stdout，如下所示：

out = find_all(master, namecount)              
print('\n'.join(out))

這樣，輸出顯示所有可能的塊。

問題的早期版本的答案

回答原始問題

awk 'f==0 && !/REDEFINES/{s=s"\n"$0;next} /REDEFINES/{f=1;print s t>("output" ++c ".txt");t=""} {t=t"\n"$0} END{print s t>("output" ++c ".txt")}' input

說明：

該程序具有以下變量：

f是在第一個REDEFINE之前為零的標志，之后是一個標志。
s包含第一個REDEFINE之前的所有文本。
t包含當前REDEFINE的文本。
c是一個計數器，用於確定輸出名稱的名稱。

代碼的工作原理如下：

f==0 && !/REDEFINES/{s=s"\\n"$0;next}

在遇到第一次重新定義之前，文本保存在變量s ，我們跳過其余的命令並跳轉到next行。
/REDEFINES/{f=1;print s t>("output" ++c ".txt");t=""}

每次遇到REDEFINE行時，我們將標志f設置為1並將prolog部分s與當前REDEFINE部分一起打印到名為outputn.txt的文件中，其中n由計數器c的值替換。
因為我們處於新的REDEFINE節的開始，所以變量t設置為空。
{t=t"\\n"$0}

將此REDEFINE的當前行保存到變量t 。
END{print s t>("output" ++c ".txt")}

打印最后一個REDEFINE部分的輸出文件。

一個小改進

上面代碼生成的每個輸出文件都有一個前導空白行。 下面的代碼通過awk substr函數刪除：

awk '/REDEFINES/{f=1;print substr(s,2) t>("output" ++c ".txt");t=""} f==0 {s=s"\n"$0;next} {t=t"\n"$0} END{print substr(s,2) t>("output" ++c ".txt")}' input

對於多樣性，此版本的邏輯略有不同，但是，否則會獲得相同的結果。

回答修訂問題

awk 'f==1 && pre==$1 && !/REDEFINES/{tail=tail "\n" $0} /REDEFINES/{pre=$1;f=1;t[++c]="\n"$0} f==0 {head=head"\n"$0;next} pre!=$1{t[c]=t[c]"\n"$0} END{for (i=0;i<=c;i++) {print head t[i] tail>("output" (i+1) ".txt")}}' file

如何使用awk替換所有組合中的不同文本塊？

問題描述

1 個解決方案

解決方案1
7 已采納 2014-10-04 20:13:45

說明

問題的早期版本的答案

回答原始問題

說明：

一個小改進

回答修訂問題

如何使用awk替換所有組合中的不同文本塊？

問題描述

1 個解決方案

解決方案1 7 已采納 2014-10-04 20:13:45

說明

問題的早期版本的答案

回答原始問題

說明：

一個小改進

回答修訂問題

解決方案1
7 已采納 2014-10-04 20:13:45