简体   繁体   English

提取一列的一部分并使用awk保存到另一文件中

[英]Extract part of one column and save into another file using awk

I have a requirement to extract fields from a csv file. 我需要从csv文件中提取字段。 There are two columns billing_info and key_id . 有两列billing_infokey_id billing_info is a object which has multiple data items in curly braces. billing_info是一个在花括号中包含多个数据项的对象。 I need to extract billing_info.id_encrypted , key_id into a different file. 我需要将billing_info.id_encrypted和key_id提取到另一个文件中。

input.csv input.csv

  billing_info,key_id
    {id: '1B82', id_encrypted: '1Q4AW5bwyU', address: 'san jose', phone: '13423', country: 'v73jyqgE='},bf6-96f751

output.csv output.csv

 billing_info.id_encrypted,key_id
 1Q4AW5bwyU,bf6-96f751

May i know how to use awk command to extract the data in format mentioned in output.csv. 我可以知道如何使用awk命令以output.csv中提到的格式提取数据。 Please help 请帮忙

Making some assumptions: 做一些假设:

  • the first line of input lists the column names 输入的第一行列出了列名
  • the brace-delimited element contains an arbitrary number of comma-separated key-value pairs 大括号分隔的元素包含任意数量的逗号分隔键/值对
  • key-value pairs can appear in an arbitrary order 键值对可以按任意顺序出现
  • values are delimited by single-quotes 值由单引号分隔
  • commas cannot appear inside keys or values 逗号不能出现在键或值内
  • single-quotes do not appear anywhere else 单引号不会出现在其他任何地方
<csvfile | awk -F, '
    BEGIN {
        getline
        print "billing_info.id_encrypted,key_id"
    }
    {
        for (i=1; i<NF; i++)
            if ($i ~ /id_encrypted/)
                split($i, e, /\047/)
        print e[2] "," $NF
    }
'

Notes: 笔记:

  • -F, splits input lines into comma-separated fields -F,将输入行拆分为逗号分隔的字段
  • BEGIN section handles the header BEGIN部分处理标题
    • we output the header even if there is no input 即使没有输入,我们也会输出标题
  • for loop runs through all the fields (except the final one) for循环遍历所有字段(最后一个字段除外)
  • ($i ~ /id_encrypted/) looks for any that contain the key word ($i ~ /id_encrypted/)查找包含关键字的任何内容
  • split splits that field on single-quotes ( /\\047/ ) split用单引号( /\\047/ )拆分该字段
  • print outputs the value found, and the final field print输出找到的值,最后一个字段

Here is a fast and elegant solution using awk: 这是使用awk的快速而优雅的解决方案:

awk -F ":" '{split($3,arr1,",");split($6,arr2,",");print arr1[1] "," arr2[2]}' input.csv > output.csv

With an explanation: 附带说明:

-F ":" make the awk field separator : -F ":"使awk字段分隔符:

split($3,arr1,",") split the 3rd field by the , into array having 2 elements. split($3,arr1,",")将第三个字段除以,分成具有2个元素的数组。

split($6,arr2,",") split the 6th field by the , into array having 2 elements. split($6,arr2,",")由分割第六字段,为具有2个元素的数组。

Then print out the first element in arr1 and the second element in arr2 . 然后打印出arr1的第一个元素和arr2的第二个元素。

I recommend you just convert your whole input to CSV and THEN you can trivially extract whatever fields you like from it using awk or Excel or any other tool, eg: 我建议您将整个输入转换为CSV,然后使用awk或Excel或任何其他工具从其中轻松提取所需的任何字段,例如:

$ cat tst.awk
BEGIN { FS=OFS="," }
FNR==1 {
    split($0,hdr)
    next
}
{
    fld[1] = fld[2] = $0
    sub(/,[^,]*$/,"",fld[1])
    gsub(/^{|}$/,"",fld[1])
    sub(/.*,/,"",fld[2])
    # print "trace: " hdr[1] "=<" fld[1] ">" | "cat>&2"
    # print "trace: " hdr[2] "=<" fld[2] ">" | "cat>&2"

    numTags = split(fld[1],tags,/'[^']*'/,vals)
    delete tags[numTags--]
    for (tagNr=1; tagNr<=numTags; tagNr++) {
        gsub(/^, *|: *$/,"",tags[tagNr])
        gsub(/^'|'$/,"",vals[tagNr])
        # print "trace:    " tagNr ": <" tags[tagNr] "=" vals[tagNr] ">" | "cat>&2"
    }
}
FNR == 2 {
    for (tagNr=1; tagNr<=numTags; tagNr++) {
        printf "%s.%s%s", hdr[1], tags[tagNr], OFS
    }
    print hdr[2]
}
{
    for (tagNr=1; tagNr<=numTags; tagNr++) {
        printf "\"%s\"%s", vals[tagNr], OFS
    }
    printf "\"%s\"%s", fld[2], ORS
}

.

$ awk -f tst.awk file
billing_info.id,billing_info.id_encrypted,billing_info.address,billing_info.phone,billing_info.country,key_id
"1B82","1Q4AW5bwyU","san jose","13423","v73jyqgE=","bf6-96f751"

The above uses GNU awk for the 4th arg to split() . 上面使用GNU awk作为split()的第四个参数。 Uncomment the print trace lines to see what each step is doing if you like. 取消注释print trace行,以查看每个步骤在做什么。 You don't need to add the double quotes around each output field if you remove or replace any commas within each field (esp. the address). 如果删除或替换每个字段中的逗号(尤其是地址),则无需在每个输出字段周围添加双引号。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM