简体   繁体   English

使用 sed 修复有问题的文本文件的行

[英]Use sed to fix problematic text file's lines

During a pipelined process in a custom fraimwork I am working on, there is a need to process files which are generated from a certain engine.在我正在处理的自定义框架中的流水线过程中,需要处理从某个引擎生成的文件。 But, the thing is that the file format for certain lines is kind of broken.但是,问题是某些行的文件格式有点损坏。 Meaning that there is an inconsistent way of printing out the lines.这意味着打印行的方式不一致。 Like this:像这样:

/VERY_LONG_NAME_FOR_A_SPECIFIC_ITEM_IN_THIS_PROCESS_A 
                                                             0           0        0.00 
/VERY_LONG_NAME_FOR_A_SPECIFIC_ITEM_IN_THIS_PROCESS_B
                                                             0           0        0.00 
/VERY_LONG_NAME_FOR_A_SPECIFIC_ITEM_IN_THIS_PROCESS_C 
                                                             1           1       100.00
/SLIGHTLY_SMALLER_NAME_OF_ITEM_D                             0           1        50.00
. 
. 
. 
.

Which I wish to transform to我希望转变为

/VERY_LONG_NAME_FOR_A_SPECIFIC_ITEM_IN_THIS_PROCESS_A        0           0    0.00                                                            
/VERY_LONG_NAME_FOR_A_SPECIFIC_ITEM_IN_THIS_PROCESS_B        0           0    0.00 
/VERY_LONG_NAME_FOR_A_SPECIFIC_ITEM_IN_THIS_PROCESS_C        1           1    100.00
/SLIGHTLY_SMALLER_NAME_OF_ITEM_D                             0           1    50.00

The problem here is that for a single entry the fields AB C are appearing after \n while for other entries (eg, D) the line is consistent.这里的问题是,对于单个条目,字段 AB C 出现在\n之后,而对于其他条目(例如 D),该行是一致的。

By using a very helpful tool (eg, regex101 ) I've managed to build a Regular expression that covers and groups the lines.通过使用一个非常有用的工具(例如regex101 ),我设法构建了一个覆盖和分组行的正则表达式。 The regex is the following:正则表达式如下:

(\/.+)\n\s+([0-1]\s+[0-1]\s+.+\b)
-----  ---  ---------------------
  |     |            |
  |     |            |=> groups the secondary line containing the digits (the first two are only 0|1)
  |     |
  |     |=> new line along with all the whitespace untill the first digit 
  |
  |=> groups the first string-stream (ex: /VERY_LONG_NAME_...)

The thing is that I am trying to re-create the file by using (most probably in an erroneous way) sed as:问题是我试图通过使用(很可能以错误的方式) sed重新创建文件:

sed -r 's/(\/.+)\n\s+([0-1]\s+[0-1]\s+.+\b)/ \1 \2/' filename.txt

which of course it does not work as I expected.这当然不像我预期的那样工作。 So am I doing something wrong here?那么我在这里做错了吗? Syntactically wise at least?至少在语法上是明智的? Furthermore, I do not wish to modify the "CORRECT" lines, meaning the lines that are not "broken" into two lines.此外,我不希望修改“正确”行,这意味着没有“断”成两行的行。 I just want to fix the 'problematic' ones我只想解决“有问题”的问题

With awk and column :使用awkcolumn

awk 'NF==1{x=$0; getline; $0=x OFS $0} {print}' filename.txt | column -t

If current row has only one column ( NF==1 ) then save complete row to variable x and read next row ( getline ) and concat last row ( x ) with output field separator ( OFS ) and current row ( $0 ) to new current row ( $0=x OFS $0 ).如果当前行只有一列( NF==1 ),则将完整行保存到变量x并读取下一行( getline )并将最后一行( x )与 output 字段分隔符( OFS )和当前行( $0 )连接到新的当前行行 ( $0=x OFS $0 )。

Output: Output:

/VERY_LONG_NAME_FOR_A_SPECIFIC_ITEM_IN_THIS_PROCESS_A  0  0  0.00
/VERY_LONG_NAME_FOR_A_SPECIFIC_ITEM_IN_THIS_PROCESS_B  0  0  0.00
/VERY_LONG_NAME_FOR_A_SPECIFIC_ITEM_IN_THIS_PROCESS_C  1  1  100.00
/SLIGHTLY_SMALLER_NAME_OF_ITEM_D                       0  1  50.00

See: 8 Powerful Awk Built-in Variables – FS, OFS , RS, ORS, NR, NF , FILENAME, FNR请参阅: 8 个强大的 Awk 内置变量 - FS、 OFS 、RS、ORS、NR、 NF 、FILENAME、FNR

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM