[英]How to replace different blocks of text in all combinations using awk?
I'm trying to replace blocks of lines like this pattern: 我正在尝试替换像这种模式的行块:
Let's see an example, this input: 让我们看一个例子,这个输入:
01 hello
02 stack
02 overflow
04 hi
02 friends = overflow
03 this
03 is
03 my = is
03 life
02 lol
02 im
02 joking = im
03 filler
Would generate the following ouput (each hello block is one element of an array): 将生成以下输出(每个hello块是数组的一个元素):
01 hello
02 stack
02 overflow
04 hi
02 lol
02 im
01 hello
02 stack
02 overflow
04 hi
02 lol
02 joking = im
03 filler
01 hello
02 stack
02 friends = overflow
03 this
03 is
03 life
02 lol
02 im
01 hello
02 stack
02 friends = overflow
03 this
03 is
03 life
02 lol
02 joking = im
03 filler
01 hello
02 stack
02 friends = overflow
03 this
03 my = is
03 life
02 lol
02 im
01 hello
02 stack
02 friends = overflow
03 this
03 my = is
03 life
02 lol
02 joking = im
03 filler
I tried it by this way: 我通过这种方式尝试了它:
#!/bin/bash
awk '{
if ($0~/=/){
level=$1
oc=1
}else if (oc && $1<=level){
oc=0
}
if (!oc){
print
}
}' input.txt
But it only returns the first output that I need, and I don't know how to skip the 03 life
word which are within friends
. 但它只返回我需要的第一个输出,我不知道如何跳过
friends
内的03 life
词。
How could I generate these outputs? 我怎么能产生这些输出?
I wouldn't mind a python or perl solution if is more confortable to you. 我不介意python或perl解决方案,如果你更舒适。
Here is a python script to read the cobol input file and print out all the possible combinations of defined and redefined variables: 这是一个python脚本,用于读取cobol输入文件并打印出已定义和重新定义的变量的所有可能组合:
#!/usr/bin/python
"""Read cobol file and print all possible redefines."""
import sys
from itertools import product
def readfile(fname):
"""Read cobol file & return a master list of lines and namecount of redefined lines."""
master = []
namecount = {}
with open(fname) as f:
for line in f:
line = line.rstrip(' .\t\n')
if not line:
continue
words = line.split()
n = int(words[0])
if '=' in words or 'REDEFINES' in words:
name = words[3]
else:
name = words[1]
master.append((n, name, line))
namecount[name] = namecount.get(name, 0) + 1
# py2.7: namecount = {key: val for key, val in namecount.items() if val > 1}
namecount = dict((key, val) for key, val in namecount.items() if val > 1)
return master, namecount
def compute(master, skip=None):
"""Return new cobol file given master and skip parameters."""
if skip is None:
skip = {}
seen = {}
skip_to = None
output = ''
for n, name, line in master:
if skip_to and n > skip_to:
continue
seen[name] = seen.get(name, 0) + 1
if seen[name] != skip.get(name, 1):
skip_to = n
continue
skip_to = None
output += line + '\n'
return output
def find_all(master, namecount):
"""Return list of all possible output files given master and namecount."""
keys = namecount.keys()
values = [namecount[k] for k in keys]
out = []
for combo in product(*[range(1, v + 1) for v in values]):
skip = dict(zip(keys, combo))
new = compute(master, skip=skip)
if new not in out:
out.append(new)
return out
def main(argv):
"""Process command line arguments and print results."""
fname = argv[-1]
master, namecount = readfile(fname)
out = find_all(master, namecount)
print('\n'.join(out))
if __name__ == '__main__':
main(sys.argv)
If the above script is save in a file called cobol.py
, then if can be run as: 如果以上脚本保存在名为
cobol.py
的文件中,则可以按以下方式运行:
python cobol.py name_of_input_file
The various possible combinations of defines and redefines will be displayed on stdout. 定义和重新定义的各种可能组合将显示在stdout上。
This script runs under either python2 (2.6+) or python3. 此脚本在python2(2.6+)或python3下运行。
The code uses three functions: 该代码使用三个函数:
readfile
reads the input file and returns two variables that summarize the structure of what is in it. readfile
读取输入文件并返回两个变量,这些变量概括了其中的结构。
compute
takes two parameters and, from them, computes an output block. compute
接受两个参数,然后从中计算出一个输出块。
find_all
determines all the possible output blocks, uses compute
to create them, and then returns them as a list. find_all
确定所有可能的输出块,使用compute
创建它们,然后将它们作为列表返回。
Let's look at each function in more detail: 让我们更详细地看一下每个函数:
readfile
readfile
takes the input file name as an argument and returns a list, master
, and a dictionary, namecount
. readfile
将输入文件名作为参数,并返回列表master
和字典namecount
。 For every non-empty line in the input file, the list master
has a tuple containing (1) the level number, (2) the name that is defined or redefined, and (2) the original line itself. 对于输入文件中的每个非空行,列表
master
文件都有一个元组,其中包含(1)级别编号,(2)定义或重新定义的名称,以及(2)原始行本身。 For the sample input file, readfile
returns this value for master
: 对于样本输入文件,
readfile
为master
返回此值:
[(1, 'hello', '01 hello'),
(2, 'stack', ' 02 stack'),
(2, 'overflow', ' 02 overflow'),
(4, 'hi', ' 04 hi'),
(2, 'overflow', ' 02 friends = overflow'),
(3, 'this', ' 03 this'),
(3, 'is', ' 03 is'),
(3, 'is', ' 03 my = is'),
(3, 'life', ' 03 life'),
(2, 'lol', ' 02 lol'),
(2, 'im', ' 02 im'),
(2, 'im', ' 02 joking = im'),
(3, 'filler', ' 03 filler')]
readfile
also returns the dictionary namecount
which has an entry for every name that gets redefined and has a count of how many definitions/redefinitions there are for that name. readfile
还返回字典namecount
,其中包含每个重新定义的名称的条目,并且具有该名称的定义/重新定义的数量。 For the sample input file, namecount
has the value: 对于示例输入文件,
namecount
具有以下值:
{'im': 2, 'is': 2, 'overflow': 2}
This indicates that im
, is
, and overflow
each have two possible values. 这表明
im
, is
和overflow
都有两个可能的值。
readfile
was of course designed to work with the input file format in the current version of the question. readfile
当然是设计用于在当前版本的问题中使用输入文件格式。 To the extent possible, it was also designed to work with the formats from the previous versions of this question. 在可能的情况下,它还被设计为使用该问题先前版本中的格式。 For example, variable redefinitions are accepted whether they are signaled with an equal sign (current version) or with the word
REFDEFINES
as in previous versions. 例如,无论是使用等号(当前版本)还是使用与先前版本中的单词
REFDEFINES
一起发信号,都可以接受变量重新定义。 This is intended to make this script as flexible as possible. 这旨在使此脚本尽可能灵活。
compute
The function compute
is what generates each output block. 函数
compute
是生成每个输出块的函数。 It uses two parameters. 它使用两个参数。 The first is
master
which comes directly from readfile
. 第一个是
master
,直接来自readfile
。 The second is skip
which is derived from the namecount
dictionary that was returned by readfile
. 第二个是
skip
,它是从readfile
返回的namecount
字典派生的。 For example, the namecount
dictionary says that there are two possible definitions for im
. 例如,
namecount
字典表示im
有两种可能的定义。 This shows how compute
can be used to generate the output block for each: 这显示了如何使用
compute
为每个生成输出块:
In [14]: print compute(master, skip={'im':1, 'is':1, 'overflow':1})
01 hello
02 stack
02 overflow
04 hi
02 lol
02 im
In [15]: print compute(master, skip={'im':2, 'is':1, 'overflow':1})
01 hello
02 stack
02 overflow
04 hi
02 lol
02 joking = im
03 filler
Observe that the first call to compute
above generated the block that uses the first definition of im
and the second call generated the block that uses the second definition. 观察到上面第一次
compute
调用生成了使用im
的第一个定义的块,第二个调用生成了使用第二个定义的块。
find_all
With the above two functions available, it is clear that the last step is just to generate all the different combinations of definitions and print them out. 有了上述两个功能,很明显最后一步只是生成所有不同的定义组合并将其打印出来。 That is what the function
find_all
does. 这就是
find_all
函数的功能。 Using master
and namecount
as returned by readfile
, it systematic runs through all the available combinations of definitions and calls compute
to create a block for each one. 使用
readfile
返回的master
和namecount
,它可以系统地遍历所有可用的定义组合,并调用compute
为每个块创建一个块。 It gathers up all the unique blocks that can be created this way and returns them. 它收集所有可以通过这种方式创建的独特块并返回它们。
The output returned by find_all
is a list of strings. find_all
返回的输出是一个字符串列表。 Each strings is the block which corresponds to one combination of defines/redefines. 每个字符串都是对应于define / redefines的一个组合的块。 Using the sample input from the question, this shows what
find_all
returns: 使用问题中的示例输入,显示
find_all
返回的内容:
In [16]: find_all(master, namecount)
Out[16]:
['01 hello\n 02 stack\n 02 overflow\n 04 hi\n 02 lol\n 02 im\n',
'01 hello\n 02 stack\n 02 friends = overflow\n 03 this\n 03 is\n 03 life\n 02 lol\n 02 im\n',
'01 hello\n 02 stack\n 02 overflow\n 04 hi\n 02 lol\n 02 joking = im\n 03 filler\n',
'01 hello\n 02 stack\n 02 friends = overflow\n 03 this\n 03 is\n 03 life\n 02 lol\n 02 joking = im\n 03 filler\n',
'01 hello\n 02 stack\n 02 friends = overflow\n 03 this\n 03 my = is\n 03 life\n 02 lol\n 02 im\n',
'01 hello\n 02 stack\n 02 friends = overflow\n 03 this\n 03 my = is\n 03 life\n 02 lol\n 02 joking = im\n 03 filler\n']
As an example, let's take the fourth string returned by find_all
and, for better format, we will print
it: 例如,让我们以
find_all
返回的第四个字符串find_all
,为了更好的格式,我们将其print
出来:
In [18]: print find_all(master, namecount)[3]
01 hello
02 stack
02 friends = overflow
03 this
03 is
03 life
02 lol
02 joking = im
03 filler
In the complete script, the output from find_all
is combined together and printed to stdout as follows: 在完整的脚本中,
find_all
的输出组合在一起并打印到stdout,如下所示:
out = find_all(master, namecount)
print('\n'.join(out))
In this way, the output displays all possible blocks. 这样,输出显示所有可能的块。
awk 'f==0 && !/REDEFINES/{s=s"\n"$0;next} /REDEFINES/{f=1;print s t>("output" ++c ".txt");t=""} {t=t"\n"$0} END{print s t>("output" ++c ".txt")}' input
This program has the following variables: 该程序具有以下变量:
f
is a flag which is zero before the first REDEFINE and one thereafter. f
是在第一个REDEFINE之前为零的标志,之后是一个标志。
s
contains all the text up to the first REDEFINE. s
包含第一个REDEFINE之前的所有文本。
t
contains the text of the current REDEFINE. t
包含当前REDEFINE的文本。
c
is a counter which is used to determine the name of the output name. c
是一个计数器,用于确定输出名称的名称。
The code works as follows: 代码的工作原理如下:
f==0 && !/REDEFINES/{s=s"\\n"$0;next}
Before the first redefine is encountered, the text is saved in the variable s
and we skip the rest of the commands and jump to the next
line. 在遇到第一次重新定义之前,文本保存在变量
s
,我们跳过其余的命令并跳转到next
行。
/REDEFINES/{f=1;print s t>("output" ++c ".txt");t=""}
Every time that we encounter a REDEFINE line, we set the flag f
to one and print the prolog section s
along with the current REDEFINE section to a file named outputn.txt
where n is replaced by the value of the counter c
. 每次遇到REDEFINE行时,我们将标志
f
设置为1并将prolog部分s
与当前REDEFINE部分一起打印到名为outputn.txt
的文件中,其中n由计数器c
的值替换。
Because we are at the start of a new REDEFINE section, the variable t
is set to empty. 因为我们处于新的REDEFINE节的开始,所以变量
t
设置为空。
{t=t"\\n"$0}
Save the current line of this REDEFINE to the variable t
. 将此REDEFINE的当前行保存到变量
t
。
END{print s t>("output" ++c ".txt")}
The output file for the last REDEFINE section is printed. 打印最后一个REDEFINE部分的输出文件。
Each of the output files produced by the code above has a leading blank line. 上面代码生成的每个输出文件都有一个前导空白行。 The code below removes that via the
awk
substr
function: 下面的代码通过
awk
substr
函数删除:
awk '/REDEFINES/{f=1;print substr(s,2) t>("output" ++c ".txt");t=""} f==0 {s=s"\n"$0;next} {t=t"\n"$0} END{print substr(s,2) t>("output" ++c ".txt")}' input
For variety, this version has slightly different logic but, otherwise, achieves the same result. 对于多样性,此版本的逻辑略有不同,但是,否则会获得相同的结果。
awk 'f==1 && pre==$1 && !/REDEFINES/{tail=tail "\n" $0} /REDEFINES/{pre=$1;f=1;t[++c]="\n"$0} f==0 {head=head"\n"$0;next} pre!=$1{t[c]=t[c]"\n"$0} END{for (i=0;i<=c;i++) {print head t[i] tail>("output" (i+1) ".txt")}}' file
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.