简体   繁体   English

Bash:按列条件分隔行,并使用循环将其附加到文本文件(或更好的选择)

[英]Bash: separate rows by a column criteria and appending to text file using loops (or better alternative)

I need help in separating a long tabulated text by every similar rows. 我需要帮助,以每行相似的行分隔长列表文本。 The task is to read from a series of disorganized file and format it then separate it by similar rows. 任务是从一系列杂乱无章的文件中读取并格式化,然后用相似的行将其分开。 From : 来自:

MMP,iP,c,002309.82,iS,002311.09,3208,18.87,L,
CNOP,eP,,003544.06,eS,003551.64,,151.00,,
SNP,iP,c,003552.87,iS,003605.55,1924.5,158.07,L,
GUIM,eP,,003554.16,eS,003608.49,,99.00,,
DCP,eP,c,003559.26,,,1214.0,88.89,L,
LLP,eP,c,003606.33,iS,003628.98,389.7,131.23,L,
PAGZ,eP,,003608.48,eS,003631.00,,76.00,,
MSLP,eP,,003618.28,,,,,,
OCLP,eP,,003618.78,eS,003646.82,,,,
TBP,eP,,003640.19,,,282.4,59.35,L,
TBP,eP,,012138.99,,,75.4,11.26,L,
SNP,iP,c,033417.94,iS,033420.44,1023.2,45.51,L,
TBP,eP,,033513.03,,,52.8,12.58,L,
SIPP,eP,d,043457.16,eS,043519.77,1212.00,109.75,L,
LLP,iP,c,054745.48,iS,054753.07,1588.5,65.12,L,
TBP,iP,c,054746.49,eS,054752.88,703.3,32.50,L,
MSLP,eP,,054747.92,eS,054757.96,,63.00,,
KCP,iP,d,082343.73,,,-71.96,180.11,T,
PGP,eP,d,085017.97,eS,085021.92,2428,18.5,L,
PGP,eP,d,085017.97,eS,085021.92,2428.00,18.50,L,
LLP,iP,d,095505.28,iS,095513.89,2940.7,105.86,L,
TBP,eP,c,095506.67,,,704.8,42.51,L,
...

I was able to format everything by using awk , read -r line condition statements and printf . 我能够通过使用awkread -r line条件语句和printf格式化所有内容。

Now I have this formatted text: 现在,我有以下格式的文本:

TBP    iP c    014449.61   iS  014455.09       2366.20     29.41   L
LLP    iP d    014450.82   iS  014457.36       1414.20     82.30   L
MSLP   eP      014456.98   eS  014509.62                   72.00    
OCLP   eP      014505.60   eS  014524.97                   69.00    
DCP    eP c    014507.15   eS  014530.52        268.80    115.79   L
GUIM   eP      014514.78   eS  014534.25                            
PAGZ   eP      014520.03   eS  014546.38                            
BUKP   eP      014520.40   eS  014546.68                            
CVP    iP d    015016.91   iS  015037.11       3695.00    162.54   L
SIPP   iP c    020817.81                                           T
BBPS   eP      025007.36   eS  025022.74                  310.00    
SGCP   eP      025009.43   eS  025025.00                  258.00    
APYP   eP      025013.77   eS  025033.51                  294.00    
SIPP   eP c    025017.98   eS  025049.24      32739.00    267.36   L
ABRA   eP      025018.32                                  317.00    
CAUP   eP      025027.99                                  317.00    
SMPP   eP      025038.70   eS  025116.93                            
BOLP   eP      025039.33   eS  025116.19                  331.00    
BALP   eP      025042.59   eS  025125.51                  280.00    
PCP    eP c    025046.89   eS  025132.15        543.00    249.71   L
LQP    eP c    025105.80                       1888.00    269.35   L
TGY    eP c    025107.21                       1728.00    183.40   L
GQP    eP c    025109.23   eS  025210.11       1481.10    180.41   L
KCP    iP d    025249.58                        -41.73    324.15   T
LUBP   eP      043452.34   eS  043459.96                   68.00    
PGP    eP c    043456.97   eS  043501.27      42702.00    196.60   L
TGY    eP d    043457.41   eS  043507.61      33835.00    157.27   L
LQP    iP d    043502.88   iS  043517.81       6307.00    168.13   L
...

Now I'm stuck in the row separation.The crucial determiner for separation is in the first 4 characters of column $4 . 现在我被困在行分隔中。分隔的关键决定因素是列$4的前四个字符。 I tried isolating $4 then cutting the first 4 characters as a basis for comparison: 我尝试隔离$4然后剪切前4个字符作为比较的基础:

From: 从:

014449.61
014450.82
014456.98
014505.60
014507.15
014514.78
014520.03
014520.40
015016.91
...

To: 至:

0144
0144
0144
0145
0145
0145
0145
0145
0150
...

But I don't know how to compare and append them altogether. 但是我不知道如何比较和附加它们。 Although uniq can be used to compare but I can't append them together as I hoped for. 尽管可以使用uniq进行比较,但是我无法按照我的期望将它们附加在一起。

The output that I hope for is: 我希望的输出是:

2014Sept01 0144
TBP    iP c    014449.61   iS  014455.09       2366.20     29.41   L
LLP    iP d    014450.82   iS  014457.36       1414.20     82.30   L
MSLP   eP      014456.98   eS  014509.62                   72.00    

2014Sept01 0145
OCLP   eP      014505.60   eS  014524.97                   69.00    
DCP    eP c    014507.15   eS  014530.52        268.80    115.79   L
GUIM   eP      014514.78   eS  014534.25            
PAGZ   eP      014520.03   eS  014546.38            
BUKP   eP      014520.40   eS  014546.68    

2014Sept01 0250
BBPS   eP      025007.36   eS  025022.74                  310.00    
SGCP   eP      025009.43   eS  025025.00                  258.00    
APYP   eP      025013.77   eS  025033.51                  294.00    
SIPP   eP c    025017.98   eS  025049.24      32739.00    267.36   L
ABRA   eP      025018.32                                  317.00    
CAUP   eP      025027.99                                  317.00    
SMPP   eP      025038.70   eS  025116.93            
BOLP   eP      025039.33   eS  025116.19                  331.00    
BALP   eP      025042.59   eS  025125.51                  280.00    
PCP    eP c    025046.89   eS  025132.15        543.00    249.71   L

2014Sept01 0251
LQP    eP c    025105.80                       1888.00    269.35   L
TGY    eP c    025107.21                       1728.00    183.40   L
GQP    eP c    025109.23   eS  025210.11       1481.10    180.41   L

2014Sept01 0252
KCP    iP d    025249.58                        -41.73    324.15   T

2014Sept01 0434
LUBP   eP      043452.34   eS  043459.96                   68.00    
PGP    eP c    043456.97   eS  043501.27      42702.00    196.60   L
TGY    eP d    043457.41   eS  043507.61      33835.00    157.27   L

2014Sept01 0435
LQP    iP d    043502.88   iS  043517.81       6307.00    168.13   L
BOAC   eP      043503.74   eS  043519.98                  139.00    
BUSP   eP      043507.46   eS  043527.58                  146.00    
OTRP   eP      043512.77   eS  043535.66                   97.00    
GQP    eP d    043513.54   iS  043537.15        714.60    117.54   L
PCP    eP c    043514.59   eS  043538.74        441.00    151.61   L
BALP   eP      043521.07   eS  043550.06                  172.00    
ENPP   eP      043521.51   eS  043546.79            
SMPP   eP      043521.96   eS  043551.39                  341.00    
JAP    eP d    043522.67                       2732.70    161.82   L
CUYO   eP      043522.99                                  160.00    
CAUP   eP      043536.77   eS  043616.73                  210.00    
...

The headers can just be appended using echo but the problem is separating per similar cut row in $4 then appending the next set of similar cut rows. 可以使用echo附加标题,但问题是在$4中将每个相似的剪切行分开,然后附加下一组相似的剪切行。 I still don't know how to loop this task. 我仍然不知道如何循环执行此任务。

Any help is appreciated. 任何帮助表示赞赏。 Thank you very much. 非常感谢你。

You can use sort to sort the input before processing. 您可以在处理之前使用sort对输入进行排序。 This will put similar lines next to each other, and solve your problem. 这将使相似的线条彼此相邻,并解决您的问题。

Since you're not sorting on the beginning of the line, you will need -t to set the field separator to comma, and -k to tell it to sort based on the fourth field. 由于您不在行首进行排序,因此需要-t将字段分隔符设置为逗号,并需要-k使其基于第四字段进行排序。

Something like sort -t, -k4,4 ought to work (untested). sort -t, -k4,4东西应该起作用(未经测试)。

In awk. 在awk。 Using the very first file (where did that 2014Sept01 come from?): 使用第一个文件( 2014Sept01是从哪里来的?):

$ awk '
BEGIN { FS=","; OFS="\t" }  # tabs as output delimiter, not pixel perfect, column might help
{
    g=substr($4,1,4)        # 4 chars from $4 to grouper
}
g!=p || NR==1 {             # if grouper changes or NR==1
    print (NR==1?"":ORS) g  # print grouper and and extra newline
    p=g                     # store previous grouper to detect change
}
{
    $1=$1                   # rebuild record for OFS 
    print                   # output
}' <(sort -t, -k4n file)    # sort file on 4th column
# | column                  # try to pipe the above to column? 
0023
MMP     iP      c       002309.82       iS      002311.09       3208    18.87   L

0035
CNOP    eP              003544.06       eS      003551.64               151.00
SNP     iP      c       003552.87       iS      003605.55       1924.5  158.07  L
GUIM    eP              003554.16       eS      003608.49               99.00
DCP     eP      c       003559.26                       1214.0  88.89   L

0036
...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM