I have a file like
ID=1234 PCharge=2 ext=5 IMSI=1234 Int:123 Charge=3
ID=1234 PCharge=2 ext=5 IMSI=1234 Charge=3
ID=1234 PCharge=2 ext=5 IMSI=1234 Int:4567 Charge=3
Charge=3 ID=1234 PCharge=2 ext=5 IMSI=1234
PCharge=2 ID=1234 Charge=3 ext=5 IMSI=1234
How can I sort this file to something like this?
ID=1234 Charge=3 PCharge=2 ext=5 IMSI=1234
ID=1234 Charge=3 PCharge=2 ext=5 IMSI=1234
ID=1234 Charge=3 PCharge=2 ext=5 IMSI=1234
ID=1234 Charge=3 PCharge=2 ext=5 IMSI=1234
ID=1234 Charge=3 PCharge=2 ext=5 IMSI=1234 Int:123
ID=1234 Charge=3 PCharge=2 ext=5 IMSI=1234 Int:4567
You can use an awk script like this:
script.awk (updated for optional keys and separators ":" and "=")
BEGIN { keys[1] = "ID"
keys[2] = "Charge"
keys[3] = "PCharge"
keys[4] = "ext"
keys[5] = "IMSI"
keys[6] = "Int"
}
NF>0 { delete values # reset each line due to optional keys
for( f =1 ; f <= NF; f++ ) {
split( $f, kv, "[=:]",seps) # split using RE separator and store individual separator in seps
values[ kv[1] ] = seps[1] kv[2] # prepend individual separator to value
}
tmp = ""
for(k = 1; k <= length(keys); k++ ) {
if( keys[k] in values) { # check due to optional keys
tmp=sprintf("%s%s%s%s",
tmp,
keys[k], values[keys[k]],
(k < NF) ? OFS : "" )
}
}
print tmp
}
Run it so: awk -f script.awk yourfile
.
The BEGIN
block setups the output-field-sequence. The condition NF > 0
on the second block skips empty rows.
The second block iterates over key=value
fields (awk does the splitting at spaces into the fields) and stores key/value pairs. In the second loop, the stored pairs are appended to tmp
for output in the previously defined sequence.
I highly recommend you print every possible field for every line and provide "N/A" values when appropriate as it'll make your data much easier to do further processing on:
$ cat tst.awk
BEGIN { OFS="," }
{
delete name2val
numFlds = split($0,flds,/[=:]|[[:space:]]+/,seps)
for (fldNr=1;fldNr<numFlds;fldNr+=2) {
name = flds[fldNr]
if ( !seen[name]++ ) {
names[++numNames] = name
}
name2sep[name] = seps[fldNr]
name2val[name] = flds[fldNr+1]
}
}
NR!=FNR {
for (nameNr=1;nameNr<=numNames;nameNr++) {
name = names[nameNr]
sep = name2sep[name]
val = (name in name2val ? name2val[name] : "N/A")
printf "%s%s%s%s", name, sep, val, (nameNr<numNames ? OFS: ORS)
}
}
$ awk -f tst.awk file file
ID=1234,PCharge=2,ext=5,IMSI=1234,Int:123,Charge=3
ID=1234,PCharge=2,ext=5,IMSI=1234,Int:N/A,Charge=3
ID=1234,PCharge=2,ext=5,IMSI=1234,Int:4567,Charge=3
ID=1234,PCharge=2,ext=5,IMSI=1234,Int:N/A,Charge=3
ID=1234,PCharge=2,ext=5,IMSI=1234,Int:N/A,Charge=3
The above uses GNU awk for the 4th arg to split(). You only need that because you use :
in Int:value
while every other name-value pair uses =
as in Charge=value
. If you were happy with Int=value
or any other consistent separator in the output then you would not need to save the separator and so would not need GNU awk for the 4th arg to split().
Note that the above doesn't require hard-coding of the field names, it just uses whatever names are in your input file by employing a 2-pass approach to read all names from every line on the first pass so it knows what all possible field names are for printing in every line on the second pass.
You should also consider changing the output format to be tabular so you can work with it in Excel, for example:
$ cat tst.awk
BEGIN { FS="[=:]|[[:space:]]+"; OFS="," }
{
delete name2val
for (fldNr=1;fldNr<NF;fldNr+=2) {
name = $fldNr
if ( !seen[name]++ ) {
names[++numNames] = name
}
name2val[name] = $(fldNr+1)
}
}
NR!=FNR {
if (FNR==1) {
for (nameNr=1;nameNr<=numNames;nameNr++) {
name = names[nameNr]
printf "%s%s", name, (nameNr<numNames ? OFS: ORS)
}
}
for (nameNr=1;nameNr<=numNames;nameNr++) {
name = names[nameNr]
val = (name in name2val ? name2val[name] : "N/A")
printf "%s%s", val, (nameNr<numNames ? OFS: ORS)
}
}
$ awk -f tst.awk file file
ID,PCharge,ext,IMSI,Int,Charge
1234,2,5,1234,123,3
1234,2,5,1234,N/A,3
1234,2,5,1234,4567,3
1234,2,5,1234,N/A,3
1234,2,5,1234,N/A,3
Note that that second script does NOT require GNU awk, it'll work in any POSIX awk, as it doesn't need to save the separator string using the gawk-specific 4th arg to split().
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.