简体   繁体   中英

Extract part of one column and save into another file using awk

I have a requirement to extract fields from a csv file. There are two columns billing_info and key_id . billing_info is a object which has multiple data items in curly braces. I need to extract billing_info.id_encrypted , key_id into a different file.

input.csv

  billing_info,key_id
    {id: '1B82', id_encrypted: '1Q4AW5bwyU', address: 'san jose', phone: '13423', country: 'v73jyqgE='},bf6-96f751

output.csv

 billing_info.id_encrypted,key_id
 1Q4AW5bwyU,bf6-96f751

May i know how to use awk command to extract the data in format mentioned in output.csv. Please help

Making some assumptions:

  • the first line of input lists the column names
  • the brace-delimited element contains an arbitrary number of comma-separated key-value pairs
  • key-value pairs can appear in an arbitrary order
  • values are delimited by single-quotes
  • commas cannot appear inside keys or values
  • single-quotes do not appear anywhere else
<csvfile | awk -F, '
    BEGIN {
        getline
        print "billing_info.id_encrypted,key_id"
    }
    {
        for (i=1; i<NF; i++)
            if ($i ~ /id_encrypted/)
                split($i, e, /\047/)
        print e[2] "," $NF
    }
'

Notes:

  • -F, splits input lines into comma-separated fields
  • BEGIN section handles the header
    • we output the header even if there is no input
  • for loop runs through all the fields (except the final one)
  • ($i ~ /id_encrypted/) looks for any that contain the key word
  • split splits that field on single-quotes ( /\\047/ )
  • print outputs the value found, and the final field

Here is a fast and elegant solution using awk:

awk -F ":" '{split($3,arr1,",");split($6,arr2,",");print arr1[1] "," arr2[2]}' input.csv > output.csv

With an explanation:

-F ":" make the awk field separator :

split($3,arr1,",") split the 3rd field by the , into array having 2 elements.

split($6,arr2,",") split the 6th field by the , into array having 2 elements.

Then print out the first element in arr1 and the second element in arr2 .

I recommend you just convert your whole input to CSV and THEN you can trivially extract whatever fields you like from it using awk or Excel or any other tool, eg:

$ cat tst.awk
BEGIN { FS=OFS="," }
FNR==1 {
    split($0,hdr)
    next
}
{
    fld[1] = fld[2] = $0
    sub(/,[^,]*$/,"",fld[1])
    gsub(/^{|}$/,"",fld[1])
    sub(/.*,/,"",fld[2])
    # print "trace: " hdr[1] "=<" fld[1] ">" | "cat>&2"
    # print "trace: " hdr[2] "=<" fld[2] ">" | "cat>&2"

    numTags = split(fld[1],tags,/'[^']*'/,vals)
    delete tags[numTags--]
    for (tagNr=1; tagNr<=numTags; tagNr++) {
        gsub(/^, *|: *$/,"",tags[tagNr])
        gsub(/^'|'$/,"",vals[tagNr])
        # print "trace:    " tagNr ": <" tags[tagNr] "=" vals[tagNr] ">" | "cat>&2"
    }
}
FNR == 2 {
    for (tagNr=1; tagNr<=numTags; tagNr++) {
        printf "%s.%s%s", hdr[1], tags[tagNr], OFS
    }
    print hdr[2]
}
{
    for (tagNr=1; tagNr<=numTags; tagNr++) {
        printf "\"%s\"%s", vals[tagNr], OFS
    }
    printf "\"%s\"%s", fld[2], ORS
}

.

$ awk -f tst.awk file
billing_info.id,billing_info.id_encrypted,billing_info.address,billing_info.phone,billing_info.country,key_id
"1B82","1Q4AW5bwyU","san jose","13423","v73jyqgE=","bf6-96f751"

The above uses GNU awk for the 4th arg to split() . Uncomment the print trace lines to see what each step is doing if you like. You don't need to add the double quotes around each output field if you remove or replace any commas within each field (esp. the address).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM