简体   繁体   中英

Bash: separate rows by a column criteria and appending to text file using loops (or better alternative)

I need help in separating a long tabulated text by every similar rows. The task is to read from a series of disorganized file and format it then separate it by similar rows. From :

MMP,iP,c,002309.82,iS,002311.09,3208,18.87,L,
CNOP,eP,,003544.06,eS,003551.64,,151.00,,
SNP,iP,c,003552.87,iS,003605.55,1924.5,158.07,L,
GUIM,eP,,003554.16,eS,003608.49,,99.00,,
DCP,eP,c,003559.26,,,1214.0,88.89,L,
LLP,eP,c,003606.33,iS,003628.98,389.7,131.23,L,
PAGZ,eP,,003608.48,eS,003631.00,,76.00,,
MSLP,eP,,003618.28,,,,,,
OCLP,eP,,003618.78,eS,003646.82,,,,
TBP,eP,,003640.19,,,282.4,59.35,L,
TBP,eP,,012138.99,,,75.4,11.26,L,
SNP,iP,c,033417.94,iS,033420.44,1023.2,45.51,L,
TBP,eP,,033513.03,,,52.8,12.58,L,
SIPP,eP,d,043457.16,eS,043519.77,1212.00,109.75,L,
LLP,iP,c,054745.48,iS,054753.07,1588.5,65.12,L,
TBP,iP,c,054746.49,eS,054752.88,703.3,32.50,L,
MSLP,eP,,054747.92,eS,054757.96,,63.00,,
KCP,iP,d,082343.73,,,-71.96,180.11,T,
PGP,eP,d,085017.97,eS,085021.92,2428,18.5,L,
PGP,eP,d,085017.97,eS,085021.92,2428.00,18.50,L,
LLP,iP,d,095505.28,iS,095513.89,2940.7,105.86,L,
TBP,eP,c,095506.67,,,704.8,42.51,L,
...

I was able to format everything by using awk , read -r line condition statements and printf .

Now I have this formatted text:

TBP    iP c    014449.61   iS  014455.09       2366.20     29.41   L
LLP    iP d    014450.82   iS  014457.36       1414.20     82.30   L
MSLP   eP      014456.98   eS  014509.62                   72.00    
OCLP   eP      014505.60   eS  014524.97                   69.00    
DCP    eP c    014507.15   eS  014530.52        268.80    115.79   L
GUIM   eP      014514.78   eS  014534.25                            
PAGZ   eP      014520.03   eS  014546.38                            
BUKP   eP      014520.40   eS  014546.68                            
CVP    iP d    015016.91   iS  015037.11       3695.00    162.54   L
SIPP   iP c    020817.81                                           T
BBPS   eP      025007.36   eS  025022.74                  310.00    
SGCP   eP      025009.43   eS  025025.00                  258.00    
APYP   eP      025013.77   eS  025033.51                  294.00    
SIPP   eP c    025017.98   eS  025049.24      32739.00    267.36   L
ABRA   eP      025018.32                                  317.00    
CAUP   eP      025027.99                                  317.00    
SMPP   eP      025038.70   eS  025116.93                            
BOLP   eP      025039.33   eS  025116.19                  331.00    
BALP   eP      025042.59   eS  025125.51                  280.00    
PCP    eP c    025046.89   eS  025132.15        543.00    249.71   L
LQP    eP c    025105.80                       1888.00    269.35   L
TGY    eP c    025107.21                       1728.00    183.40   L
GQP    eP c    025109.23   eS  025210.11       1481.10    180.41   L
KCP    iP d    025249.58                        -41.73    324.15   T
LUBP   eP      043452.34   eS  043459.96                   68.00    
PGP    eP c    043456.97   eS  043501.27      42702.00    196.60   L
TGY    eP d    043457.41   eS  043507.61      33835.00    157.27   L
LQP    iP d    043502.88   iS  043517.81       6307.00    168.13   L
...

Now I'm stuck in the row separation.The crucial determiner for separation is in the first 4 characters of column $4 . I tried isolating $4 then cutting the first 4 characters as a basis for comparison:

From:

014449.61
014450.82
014456.98
014505.60
014507.15
014514.78
014520.03
014520.40
015016.91
...

To:

0144
0144
0144
0145
0145
0145
0145
0145
0150
...

But I don't know how to compare and append them altogether. Although uniq can be used to compare but I can't append them together as I hoped for.

The output that I hope for is:

2014Sept01 0144
TBP    iP c    014449.61   iS  014455.09       2366.20     29.41   L
LLP    iP d    014450.82   iS  014457.36       1414.20     82.30   L
MSLP   eP      014456.98   eS  014509.62                   72.00    

2014Sept01 0145
OCLP   eP      014505.60   eS  014524.97                   69.00    
DCP    eP c    014507.15   eS  014530.52        268.80    115.79   L
GUIM   eP      014514.78   eS  014534.25            
PAGZ   eP      014520.03   eS  014546.38            
BUKP   eP      014520.40   eS  014546.68    

2014Sept01 0250
BBPS   eP      025007.36   eS  025022.74                  310.00    
SGCP   eP      025009.43   eS  025025.00                  258.00    
APYP   eP      025013.77   eS  025033.51                  294.00    
SIPP   eP c    025017.98   eS  025049.24      32739.00    267.36   L
ABRA   eP      025018.32                                  317.00    
CAUP   eP      025027.99                                  317.00    
SMPP   eP      025038.70   eS  025116.93            
BOLP   eP      025039.33   eS  025116.19                  331.00    
BALP   eP      025042.59   eS  025125.51                  280.00    
PCP    eP c    025046.89   eS  025132.15        543.00    249.71   L

2014Sept01 0251
LQP    eP c    025105.80                       1888.00    269.35   L
TGY    eP c    025107.21                       1728.00    183.40   L
GQP    eP c    025109.23   eS  025210.11       1481.10    180.41   L

2014Sept01 0252
KCP    iP d    025249.58                        -41.73    324.15   T

2014Sept01 0434
LUBP   eP      043452.34   eS  043459.96                   68.00    
PGP    eP c    043456.97   eS  043501.27      42702.00    196.60   L
TGY    eP d    043457.41   eS  043507.61      33835.00    157.27   L

2014Sept01 0435
LQP    iP d    043502.88   iS  043517.81       6307.00    168.13   L
BOAC   eP      043503.74   eS  043519.98                  139.00    
BUSP   eP      043507.46   eS  043527.58                  146.00    
OTRP   eP      043512.77   eS  043535.66                   97.00    
GQP    eP d    043513.54   iS  043537.15        714.60    117.54   L
PCP    eP c    043514.59   eS  043538.74        441.00    151.61   L
BALP   eP      043521.07   eS  043550.06                  172.00    
ENPP   eP      043521.51   eS  043546.79            
SMPP   eP      043521.96   eS  043551.39                  341.00    
JAP    eP d    043522.67                       2732.70    161.82   L
CUYO   eP      043522.99                                  160.00    
CAUP   eP      043536.77   eS  043616.73                  210.00    
...

The headers can just be appended using echo but the problem is separating per similar cut row in $4 then appending the next set of similar cut rows. I still don't know how to loop this task.

Any help is appreciated. Thank you very much.

You can use sort to sort the input before processing. This will put similar lines next to each other, and solve your problem.

Since you're not sorting on the beginning of the line, you will need -t to set the field separator to comma, and -k to tell it to sort based on the fourth field.

Something like sort -t, -k4,4 ought to work (untested).

In awk. Using the very first file (where did that 2014Sept01 come from?):

$ awk '
BEGIN { FS=","; OFS="\t" }  # tabs as output delimiter, not pixel perfect, column might help
{
    g=substr($4,1,4)        # 4 chars from $4 to grouper
}
g!=p || NR==1 {             # if grouper changes or NR==1
    print (NR==1?"":ORS) g  # print grouper and and extra newline
    p=g                     # store previous grouper to detect change
}
{
    $1=$1                   # rebuild record for OFS 
    print                   # output
}' <(sort -t, -k4n file)    # sort file on 4th column
# | column                  # try to pipe the above to column? 
0023
MMP     iP      c       002309.82       iS      002311.09       3208    18.87   L

0035
CNOP    eP              003544.06       eS      003551.64               151.00
SNP     iP      c       003552.87       iS      003605.55       1924.5  158.07  L
GUIM    eP              003554.16       eS      003608.49               99.00
DCP     eP      c       003559.26                       1214.0  88.89   L

0036
...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM