简体   繁体   中英

Sort columns of a file using Linux tools

I have a file like

ID=1234 PCharge=2 ext=5 IMSI=1234 Int:123 Charge=3
ID=1234 PCharge=2 ext=5 IMSI=1234 Charge=3
ID=1234 PCharge=2 ext=5 IMSI=1234 Int:4567 Charge=3
Charge=3 ID=1234 PCharge=2 ext=5 IMSI=1234
PCharge=2 ID=1234 Charge=3 ext=5 IMSI=1234

How can I sort this file to something like this?

ID=1234 Charge=3 PCharge=2 ext=5 IMSI=1234
ID=1234 Charge=3 PCharge=2 ext=5 IMSI=1234
ID=1234 Charge=3 PCharge=2 ext=5 IMSI=1234
ID=1234 Charge=3 PCharge=2 ext=5 IMSI=1234
ID=1234 Charge=3 PCharge=2 ext=5 IMSI=1234 Int:123
ID=1234 Charge=3 PCharge=2 ext=5 IMSI=1234 Int:4567

You can use an awk script like this:

script.awk (updated for optional keys and separators ":" and "=")

BEGIN { keys[1] = "ID"
        keys[2] = "Charge"
        keys[3] = "PCharge"
        keys[4] = "ext"
        keys[5] = "IMSI"
        keys[6] = "Int"
      }

NF>0  { delete values # reset each line due to optional keys
        for( f =1 ; f <= NF; f++ ) {
          split( $f, kv, "[=:]",seps) # split using RE separator and store individual separator in seps
          values[ kv[1] ] = seps[1] kv[2] # prepend individual separator to value
        }

        tmp = ""
        for(k = 1; k <= length(keys); k++ ) {
            if( keys[k] in values) { # check due to optional keys
                tmp=sprintf("%s%s%s%s",
                             tmp,
                             keys[k], values[keys[k]], 
                             (k < NF) ? OFS : "" ) 
            }
        }
        print tmp
      }

Run it so: awk -f script.awk yourfile .

The BEGIN block setups the output-field-sequence. The condition NF > 0 on the second block skips empty rows.

The second block iterates over key=value fields (awk does the splitting at spaces into the fields) and stores key/value pairs. In the second loop, the stored pairs are appended to tmp for output in the previously defined sequence.

I highly recommend you print every possible field for every line and provide "N/A" values when appropriate as it'll make your data much easier to do further processing on:

$ cat tst.awk
BEGIN { OFS="," }
{
    delete name2val
    numFlds = split($0,flds,/[=:]|[[:space:]]+/,seps)
    for (fldNr=1;fldNr<numFlds;fldNr+=2) {
        name = flds[fldNr]
        if ( !seen[name]++ ) {
            names[++numNames] = name
        }
        name2sep[name] = seps[fldNr]
        name2val[name] = flds[fldNr+1]
    }
}
NR!=FNR {
    for (nameNr=1;nameNr<=numNames;nameNr++) {
        name = names[nameNr]
        sep  = name2sep[name]
        val  = (name in name2val ? name2val[name] : "N/A")
        printf "%s%s%s%s", name, sep, val, (nameNr<numNames ? OFS: ORS)
    }
}

$ awk -f tst.awk file file
ID=1234,PCharge=2,ext=5,IMSI=1234,Int:123,Charge=3
ID=1234,PCharge=2,ext=5,IMSI=1234,Int:N/A,Charge=3
ID=1234,PCharge=2,ext=5,IMSI=1234,Int:4567,Charge=3
ID=1234,PCharge=2,ext=5,IMSI=1234,Int:N/A,Charge=3
ID=1234,PCharge=2,ext=5,IMSI=1234,Int:N/A,Charge=3

The above uses GNU awk for the 4th arg to split(). You only need that because you use : in Int:value while every other name-value pair uses = as in Charge=value . If you were happy with Int=value or any other consistent separator in the output then you would not need to save the separator and so would not need GNU awk for the 4th arg to split().

Note that the above doesn't require hard-coding of the field names, it just uses whatever names are in your input file by employing a 2-pass approach to read all names from every line on the first pass so it knows what all possible field names are for printing in every line on the second pass.

You should also consider changing the output format to be tabular so you can work with it in Excel, for example:

$ cat tst.awk
BEGIN { FS="[=:]|[[:space:]]+"; OFS="," }
{
    delete name2val
    for (fldNr=1;fldNr<NF;fldNr+=2) {
        name = $fldNr
        if ( !seen[name]++ ) {
            names[++numNames] = name
        }
        name2val[name] = $(fldNr+1)
    }
}
NR!=FNR {
    if (FNR==1) {
        for (nameNr=1;nameNr<=numNames;nameNr++) {
            name = names[nameNr]
            printf "%s%s", name, (nameNr<numNames ? OFS: ORS)
        }
    }
    for (nameNr=1;nameNr<=numNames;nameNr++) {
        name = names[nameNr]
        val  = (name in name2val ? name2val[name] : "N/A")
        printf "%s%s", val, (nameNr<numNames ? OFS: ORS)
    }
}

$ awk -f tst.awk file file
ID,PCharge,ext,IMSI,Int,Charge
1234,2,5,1234,123,3
1234,2,5,1234,N/A,3
1234,2,5,1234,4567,3
1234,2,5,1234,N/A,3
1234,2,5,1234,N/A,3

Note that that second script does NOT require GNU awk, it'll work in any POSIX awk, as it doesn't need to save the separator string using the gawk-specific 4th arg to split().

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM