I need help in separating a long tabulated text by every similar rows. The task is to read from a series of disorganized file and format it then separate it by similar rows. From :
MMP,iP,c,002309.82,iS,002311.09,3208,18.87,L,
CNOP,eP,,003544.06,eS,003551.64,,151.00,,
SNP,iP,c,003552.87,iS,003605.55,1924.5,158.07,L,
GUIM,eP,,003554.16,eS,003608.49,,99.00,,
DCP,eP,c,003559.26,,,1214.0,88.89,L,
LLP,eP,c,003606.33,iS,003628.98,389.7,131.23,L,
PAGZ,eP,,003608.48,eS,003631.00,,76.00,,
MSLP,eP,,003618.28,,,,,,
OCLP,eP,,003618.78,eS,003646.82,,,,
TBP,eP,,003640.19,,,282.4,59.35,L,
TBP,eP,,012138.99,,,75.4,11.26,L,
SNP,iP,c,033417.94,iS,033420.44,1023.2,45.51,L,
TBP,eP,,033513.03,,,52.8,12.58,L,
SIPP,eP,d,043457.16,eS,043519.77,1212.00,109.75,L,
LLP,iP,c,054745.48,iS,054753.07,1588.5,65.12,L,
TBP,iP,c,054746.49,eS,054752.88,703.3,32.50,L,
MSLP,eP,,054747.92,eS,054757.96,,63.00,,
KCP,iP,d,082343.73,,,-71.96,180.11,T,
PGP,eP,d,085017.97,eS,085021.92,2428,18.5,L,
PGP,eP,d,085017.97,eS,085021.92,2428.00,18.50,L,
LLP,iP,d,095505.28,iS,095513.89,2940.7,105.86,L,
TBP,eP,c,095506.67,,,704.8,42.51,L,
...
I was able to format everything by using awk
, read -r line
condition statements and printf
.
Now I have this formatted text:
TBP iP c 014449.61 iS 014455.09 2366.20 29.41 L
LLP iP d 014450.82 iS 014457.36 1414.20 82.30 L
MSLP eP 014456.98 eS 014509.62 72.00
OCLP eP 014505.60 eS 014524.97 69.00
DCP eP c 014507.15 eS 014530.52 268.80 115.79 L
GUIM eP 014514.78 eS 014534.25
PAGZ eP 014520.03 eS 014546.38
BUKP eP 014520.40 eS 014546.68
CVP iP d 015016.91 iS 015037.11 3695.00 162.54 L
SIPP iP c 020817.81 T
BBPS eP 025007.36 eS 025022.74 310.00
SGCP eP 025009.43 eS 025025.00 258.00
APYP eP 025013.77 eS 025033.51 294.00
SIPP eP c 025017.98 eS 025049.24 32739.00 267.36 L
ABRA eP 025018.32 317.00
CAUP eP 025027.99 317.00
SMPP eP 025038.70 eS 025116.93
BOLP eP 025039.33 eS 025116.19 331.00
BALP eP 025042.59 eS 025125.51 280.00
PCP eP c 025046.89 eS 025132.15 543.00 249.71 L
LQP eP c 025105.80 1888.00 269.35 L
TGY eP c 025107.21 1728.00 183.40 L
GQP eP c 025109.23 eS 025210.11 1481.10 180.41 L
KCP iP d 025249.58 -41.73 324.15 T
LUBP eP 043452.34 eS 043459.96 68.00
PGP eP c 043456.97 eS 043501.27 42702.00 196.60 L
TGY eP d 043457.41 eS 043507.61 33835.00 157.27 L
LQP iP d 043502.88 iS 043517.81 6307.00 168.13 L
...
Now I'm stuck in the row separation.The crucial determiner for separation is in the first 4 characters of column $4
. I tried isolating $4
then cutting the first 4 characters as a basis for comparison:
From:
014449.61
014450.82
014456.98
014505.60
014507.15
014514.78
014520.03
014520.40
015016.91
...
To:
0144
0144
0144
0145
0145
0145
0145
0145
0150
...
But I don't know how to compare and append them altogether. Although uniq
can be used to compare but I can't append them together as I hoped for.
The output that I hope for is:
2014Sept01 0144
TBP iP c 014449.61 iS 014455.09 2366.20 29.41 L
LLP iP d 014450.82 iS 014457.36 1414.20 82.30 L
MSLP eP 014456.98 eS 014509.62 72.00
2014Sept01 0145
OCLP eP 014505.60 eS 014524.97 69.00
DCP eP c 014507.15 eS 014530.52 268.80 115.79 L
GUIM eP 014514.78 eS 014534.25
PAGZ eP 014520.03 eS 014546.38
BUKP eP 014520.40 eS 014546.68
2014Sept01 0250
BBPS eP 025007.36 eS 025022.74 310.00
SGCP eP 025009.43 eS 025025.00 258.00
APYP eP 025013.77 eS 025033.51 294.00
SIPP eP c 025017.98 eS 025049.24 32739.00 267.36 L
ABRA eP 025018.32 317.00
CAUP eP 025027.99 317.00
SMPP eP 025038.70 eS 025116.93
BOLP eP 025039.33 eS 025116.19 331.00
BALP eP 025042.59 eS 025125.51 280.00
PCP eP c 025046.89 eS 025132.15 543.00 249.71 L
2014Sept01 0251
LQP eP c 025105.80 1888.00 269.35 L
TGY eP c 025107.21 1728.00 183.40 L
GQP eP c 025109.23 eS 025210.11 1481.10 180.41 L
2014Sept01 0252
KCP iP d 025249.58 -41.73 324.15 T
2014Sept01 0434
LUBP eP 043452.34 eS 043459.96 68.00
PGP eP c 043456.97 eS 043501.27 42702.00 196.60 L
TGY eP d 043457.41 eS 043507.61 33835.00 157.27 L
2014Sept01 0435
LQP iP d 043502.88 iS 043517.81 6307.00 168.13 L
BOAC eP 043503.74 eS 043519.98 139.00
BUSP eP 043507.46 eS 043527.58 146.00
OTRP eP 043512.77 eS 043535.66 97.00
GQP eP d 043513.54 iS 043537.15 714.60 117.54 L
PCP eP c 043514.59 eS 043538.74 441.00 151.61 L
BALP eP 043521.07 eS 043550.06 172.00
ENPP eP 043521.51 eS 043546.79
SMPP eP 043521.96 eS 043551.39 341.00
JAP eP d 043522.67 2732.70 161.82 L
CUYO eP 043522.99 160.00
CAUP eP 043536.77 eS 043616.73 210.00
...
The headers can just be appended using echo
but the problem is separating per similar cut row in $4
then appending the next set of similar cut rows. I still don't know how to loop this task.
Any help is appreciated. Thank you very much.
You can use sort
to sort the input before processing. This will put similar lines next to each other, and solve your problem.
Since you're not sorting on the beginning of the line, you will need -t
to set the field separator to comma, and -k
to tell it to sort based on the fourth field.
Something like sort -t, -k4,4
ought to work (untested).
In awk. Using the very first file (where did that 2014Sept01
come from?):
$ awk '
BEGIN { FS=","; OFS="\t" } # tabs as output delimiter, not pixel perfect, column might help
{
g=substr($4,1,4) # 4 chars from $4 to grouper
}
g!=p || NR==1 { # if grouper changes or NR==1
print (NR==1?"":ORS) g # print grouper and and extra newline
p=g # store previous grouper to detect change
}
{
$1=$1 # rebuild record for OFS
print # output
}' <(sort -t, -k4n file) # sort file on 4th column
# | column # try to pipe the above to column?
0023
MMP iP c 002309.82 iS 002311.09 3208 18.87 L
0035
CNOP eP 003544.06 eS 003551.64 151.00
SNP iP c 003552.87 iS 003605.55 1924.5 158.07 L
GUIM eP 003554.16 eS 003608.49 99.00
DCP eP c 003559.26 1214.0 88.89 L
0036
...
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.