简体   繁体   中英

AWK FPAT not working as expected for string parsing

I have to parse a very large length string (from stdin). It is basically a.sql file. I have to get data from it. I am working to parse the data so that I can convert it into csv. For this, I am using awk. For my case, A sample snippet (of two records) is as follows:

b="(abc@xyz.com,www.example.com,'field2,(2)'),(dfr@xyz.com,www.example.com,'field0'),"
echo $b|awk 'BEGIN {FPAT = "([^\\)]+)|('\''[^'\'']+'\'')"}{print $1}'

In my regex, I am saying that split on ")" bracket or if single quotes are found then ignore all text until last quote is found. But my output is as follows:

(abc@xyz.com,www.example.com,'field2,(2

I am expecting this output

(abc@xyz.com,www.example.com,'field2,(2)'

Where is the problem in my code. I am search a lot and check awk manual for this but not successful.

My first answer below was wrong, there is an ERE for what you're trying to do:

$ echo "$b" | awk -v FPAT="[(]([^)]|'[^']*')*)" '{for (i=1; i<=NF; i++) print $i}'
(abc@xyz.com,www.example.com,'field2,(2)')
(dfr@xyz.com,www.example.com,'field0')

Original answer, left as a different approach:

You need a 2-pass approach first to replace all ) s within quoted fields with something that can't already exist in the input (eg RS) and then to identify the (...) fields and put the RSs back to ) s before printing them:

$ echo "$b" |
awk -F"'" -v OFS= '
    {
        for (i=2; i<=NF; i+=2) {
            gsub(/)/,RS,$i)
            $i = FS $i FS
        }
        FPAT = "[(][^)]*)"
        $0 = $0
        for (i=1; i<=NF; i++) {
            gsub(RS,")",$i)
            print $i
        }
        FS = FS
    }
'
(abc@xyz.com,www.example.com,'field2,(2)')
(dfr@xyz.com,www.example.com,'field0')

The above is gawk-only due to FPAT (or we could have used gawk patsplit() ), with other awks you'd used a while-match()-substr() loop:

$ echo "$b" |
awk -F"'" -v OFS= '
    {
        for (i=2; i<=NF; i+=2) {
            gsub(/)/,RS,$i)
            $i = FS $i FS
        }
        while ( match($0,/[(][^)]*)/) ) {
            field = substr($0,RSTART,RLENGTH)
            gsub(RS,")",field)
            print field
            $0 = substr($0,RSTART+RLENGTH)
        }
    }
'
(abc@xyz.com,www.example.com,'field2,(2)')
(dfr@xyz.com,www.example.com,'field0')

Written and tested with your shown samples in GNU awk . This could be done in simple field separator setting, try following once, where b is your shell variable which has your shown value in it.

echo "$b" | awk -F'\\),\\(' '{print $1}'
(abc@xyz.com,www.example.com,'field2,(2)'

Explanation: Simply setting field separator of awk program to \\),\\( for your input and printing first field of it.

Similar regex approach as Ed has suggested but I usually prefer using RS and RT over FPAT :

b="(abc@xyz.com,www.example.com,'field2,(2)'),(dfr@xyz.com,www.example.com,'field0'),"
awk -v RS="[(]('[^']*'|[^)])*[)]" 'RT {print RT}' <<< "$b"
(abc@xyz.com,www.example.com,'field2,(2)')
(dfr@xyz.com,www.example.com,'field0')

if you wanna do it close to one pass, maybe try this

{mawk/mawk2/gawk} 'BEGIN { OFS = FS = "\047"; ORS = RS = "\n";

        XFS = "\376\004\377"; 
        XRS = "\051" ORS;
    
    } ! /[\051]/ { print; next; } { for (x=1; x <= NF; x += 2) { 

        gsub(/[\051][^\050]*/, XFS, $(x)); } } gsub(XFS, XRS) || 1'

I did it this way with 2 gsubs just in case it starts sending rows below with unintended consequences. \051 = ")", \050 is the open one.

  • further enhanced it by telling it to instantly print and move on if no close brackets are even found (so nothing to split at all)

It only loops over odd-numbered fields once i split it by the single quote \047 (cuz even numbered ones are precisely the ones within a pair of single quotes you want to avoid chopping at).

As for XFS, just pick any combination of your choice using bytes that are almost impossible to encounter. If you want to play it safe, you can test for whether XFS exists in that row, and use some alternative combo. It's basically to insert a delimiter into the middle of the row that wouldn't run afoul with actual input data. It's not fool proof per se, but the likelihood of running into a combination of UTF16 Byte order mark and ASCII control characters is reasonably low.

(and if you encounter XFS, it's likely you already have corrupted data to begin with, since a 300 series octal must be followed by 200 series ones to be valid UTF8)

This way, i wouldn't need FPAT at all.

*updated with " || 1" towards the end as a safety catch-all, but shouldn't really be needed.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM