简体   繁体   English

AWK FPAT 无法按预期进行字符串解析

[英]AWK FPAT not working as expected for string parsing

I have to parse a very large length string (from stdin).我必须解析一个非常长的字符串(来自标准输入)。 It is basically a.sql file.它基本上是一个.sql 文件。 I have to get data from it.我必须从中获取数据。 I am working to parse the data so that I can convert it into csv.我正在解析数据,以便将其转换为 csv。 For this, I am using awk.为此,我使用的是 awk。 For my case, A sample snippet (of two records) is as follows:就我而言,一个示例片段(两条记录)如下:

b="(abc@xyz.com,www.example.com,'field2,(2)'),(dfr@xyz.com,www.example.com,'field0'),"
echo $b|awk 'BEGIN {FPAT = "([^\\)]+)|('\''[^'\'']+'\'')"}{print $1}'

In my regex, I am saying that split on ")" bracket or if single quotes are found then ignore all text until last quote is found.在我的正则表达式中,我说的是“)”括号上的拆分,或者如果找到单引号,则忽略所有文本,直到找到最后一个引号。 But my output is as follows:但是我的output如下:

(abc@xyz.com,www.example.com,'field2,(2

I am expecting this output我期待这个 output

(abc@xyz.com,www.example.com,'field2,(2)'

Where is the problem in my code.我的代码中的问题在哪里。 I am search a lot and check awk manual for this but not successful.我搜索了很多并检查了 awk 手册,但没有成功。

My first answer below was wrong, there is an ERE for what you're trying to do:我在下面的第一个答案是错误的,您正在尝试做的事情有一个 ERE:

$ echo "$b" | awk -v FPAT="[(]([^)]|'[^']*')*)" '{for (i=1; i<=NF; i++) print $i}'
(abc@xyz.com,www.example.com,'field2,(2)')
(dfr@xyz.com,www.example.com,'field0')

Original answer, left as a different approach:原始答案,另一种方法:

You need a 2-pass approach first to replace all ) s within quoted fields with something that can't already exist in the input (eg RS) and then to identify the (...) fields and put the RSs back to ) s before printing them:您需要一种 2-pass 方法,首先将引用字段中的所有)替换为输入中尚不存在的内容(例如 RS),然后识别(...)字段并将 RS 放回) s在打印它们之前:

$ echo "$b" |
awk -F"'" -v OFS= '
    {
        for (i=2; i<=NF; i+=2) {
            gsub(/)/,RS,$i)
            $i = FS $i FS
        }
        FPAT = "[(][^)]*)"
        $0 = $0
        for (i=1; i<=NF; i++) {
            gsub(RS,")",$i)
            print $i
        }
        FS = FS
    }
'
(abc@xyz.com,www.example.com,'field2,(2)')
(dfr@xyz.com,www.example.com,'field0')

The above is gawk-only due to FPAT (or we could have used gawk patsplit() ), with other awks you'd used a while-match()-substr() loop:由于 FPAT,以上内容仅适用于 gawk(或者我们可以使用 gawk patsplit() ),而其他 awk 则使用了 while-match()-substr() 循环:

$ echo "$b" |
awk -F"'" -v OFS= '
    {
        for (i=2; i<=NF; i+=2) {
            gsub(/)/,RS,$i)
            $i = FS $i FS
        }
        while ( match($0,/[(][^)]*)/) ) {
            field = substr($0,RSTART,RLENGTH)
            gsub(RS,")",field)
            print field
            $0 = substr($0,RSTART+RLENGTH)
        }
    }
'
(abc@xyz.com,www.example.com,'field2,(2)')
(dfr@xyz.com,www.example.com,'field0')

Written and tested with your shown samples in GNU awk .使用您在 GNU awk中显示的示例编写和测试。 This could be done in simple field separator setting, try following once, where b is your shell variable which has your shown value in it.这可以在简单的字段分隔符设置中完成,请尝试执行一次,其中b是您的 shell 变量,其中包含您的显示值。

echo "$b" | awk -F'\\),\\(' '{print $1}'
(abc@xyz.com,www.example.com,'field2,(2)'

Explanation: Simply setting field separator of awk program to \\),\\( for your input and printing first field of it.说明:只需将awk程序的字段分隔符设置为\\),\\(用于您的输入和打印它的第一个字段。

Similar regex approach as Ed has suggested but I usually prefer using RS and RT over FPAT :与 Ed 建议的类似的正则表达式方法,但我通常更喜欢使用RSRT而不是FPAT

b="(abc@xyz.com,www.example.com,'field2,(2)'),(dfr@xyz.com,www.example.com,'field0'),"
awk -v RS="[(]('[^']*'|[^)])*[)]" 'RT {print RT}' <<< "$b"
(abc@xyz.com,www.example.com,'field2,(2)')
(dfr@xyz.com,www.example.com,'field0')

if you wanna do it close to one pass, maybe try this如果你想接近一次,也许试试这个

{mawk/mawk2/gawk} 'BEGIN { OFS = FS = "\047"; ORS = RS = "\n";

        XFS = "\376\004\377"; 
        XRS = "\051" ORS;
    
    } ! /[\051]/ { print; next; } { for (x=1; x <= NF; x += 2) { 

        gsub(/[\051][^\050]*/, XFS, $(x)); } } gsub(XFS, XRS) || 1'

I did it this way with 2 gsubs just in case it starts sending rows below with unintended consequences.我用 2 个 gsub 这样做,以防它开始在下面发送行并产生意想不到的后果。 \051 = ")", \050 is the open one. \051 = ")", \050 是开放的。

  • further enhanced it by telling it to instantly print and move on if no close brackets are even found (so nothing to split at all)如果没有找到右括号,则告诉它立即打印并继续前进,从而进一步增强了它(所以根本没有要拆分的东西)

It only loops over odd-numbered fields once i split it by the single quote \047 (cuz even numbered ones are precisely the ones within a pair of single quotes you want to avoid chopping at).一旦我用单引号 \047 拆分它,它只会在奇数字段上循环(因为偶数字段恰好是一对单引号中你想要避免砍掉的那些)。

As for XFS, just pick any combination of your choice using bytes that are almost impossible to encounter.至于 XFS,只需使用几乎不可能遇到的字节来选择您选择的任意组合。 If you want to play it safe, you can test for whether XFS exists in that row, and use some alternative combo.如果您想安全起见,可以测试该行中是否存在 XFS,并使用一些替代组合。 It's basically to insert a delimiter into the middle of the row that wouldn't run afoul with actual input data.基本上是在不会与实际输入数据发生冲突的行中间插入一个分隔符。 It's not fool proof per se, but the likelihood of running into a combination of UTF16 Byte order mark and ASCII control characters is reasonably low.它本身并不是万无一失的,但是遇到 UTF16 字节顺序标记和 ASCII 控制字符组合的可能性相当低。

(and if you encounter XFS, it's likely you already have corrupted data to begin with, since a 300 series octal must be followed by 200 series ones to be valid UTF8) (如果您遇到 XFS,很可能您一开始就已经损坏了数据,因为 300 系列八进制必须后跟 200 系列八进制才能成为有效的 UTF8)

This way, i wouldn't need FPAT at all.这样,我根本不需要 FPAT。

*updated with " || 1" towards the end as a safety catch-all, but shouldn't really be needed. *在结尾处更新为“|| 1”作为安全保护,但实际上并不需要。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM