简体   繁体   English

使用Awk处理每个记录具有不同固定宽度字段的文件

[英]Using Awk to process a file where each record has different fixed-width fields

I have some data files from a legacy system that I would like to process using Awk. 我有遗留系统的一些数据文件,我想用Awk处理。 Each file consists of a list of records. 每个文件都包含一个记录列表。 There are several different record types and each record type has a different set of fixed-width fields (there is no field separator character). 有几种不同的记录类型,每种记录类型都有一组不同的固定宽度字段(没有字段分隔符)。 The first two characters of the record indicate the type, from this you then know which fields should follow. 记录的前两个字符表示类型,然后您可以知道应该遵循哪些字段。 A file might look something like this: 文件可能如下所示:

AAField1Field2LongerField3
BBField4Field5Field6VeryVeryLongField7Field8
CCField99

Using Gawk I can set the FIELDWIDTHS , but that applies to the whole file (unless I am missing some way of setting this on a record-by-record basis), or I can set FS to "" and process the file one character at a time, but that's a bit cumbersome. 使用Gawk我可以设置FIELDWIDTHS ,但这适用于整个文件(除非我在某个记录的基础上缺少某种方式设置它),或者我可以将FS设置为“”并处理文件中的一个字符一段时间,但这有点麻烦。

Is there a good way to extract the fields from such a file using Awk? 有没有一种使用Awk从这样的文件中提取字段的好方法?

Edit : Yes, I could use Perl (or something else). 编辑 :是的,我可以使用Perl(或其他)。 I'm still keen to know whether there is a sensible way of doing it with Awk though. 我仍然很想知道是否有一种合理的方法可以用Awk做到这一点。

Hopefully this will lead you in the right direction. 希望这会引导您朝着正确的方向前进。 Assuming your multi-line records are guaranteed to be terminated by a 'CC' type row you can pre-process your text file using simple if-then logic. 假设您的多行记录保证由“CC”类型行终止,您可以使用简单的if-then逻辑预处理文本文件。 I have presumed you require fields1,5 and 7 on one row and a sample awk script would be. 我假设您需要在一行上使用fields1,5和7,并且需要一个示例awk脚本。

BEGIN {
        field1=""
        field5=""
        field7=""
}
{
    record_type = substr($0,1,2)
    if (record_type == "AA")
    {
        field1=substr($0,3,6)
    }
    else if (record_type == "BB")
    {
        field5=substr($0,9,6)
        field7=substr($0,21,18)
    }
    else if (record_type == "CC")
    {
        print field1"|"field5"|"field7
    }
}

Create an awk script file called program.awk and pop that code into it. 创建一个名为program.awk的awk脚本文件,并将该代码弹入其中。 Execute the script using : 使用以下命令执行脚本:

awk -f program.awk < my_multi_line_file.txt 

You maybe can use two passes: 你可以使用两个通行证:

1step.awk 1step.awk

/^AA/{printf "2 6 6 12"    }
/^BB/{printf "2 6 6 6 18 6"}
/^CC/{printf "2 8"         }
{printf "\n%s\n", $0}

2step.awk 2step.awk

NR%2 == 1 {FIELDWIDTHS=$0}
NR%2 == 0 {print $2}

And then 接着

awk -f 1step.awk sample  | awk -f 2step.awk

You probably need to suppress (or at least ignore) awk 's built-in field separation code, and use a program along the lines of: 您可能需要抑制(或至少忽略) awk的内置字段分隔代码,并使用以下行的程序:

awk '/^AA/ { manually process record AA out of $0 }
     /^BB/ { manually process record BB out of $0 }
     /^CC/ { manually process record CC out of $0 }' file ...

The manual processing will be a bit fiddly - I suppose you'll need to use the substr function to extract each field by position, so what I've got as one line per record type will be more like one line per field in each record type, plus the follow-on printing. 手动处理会有点繁琐 - 我想你需要使用substr函数按位置提取每个字段,所以我每个记录类型的一行更像是每个记录中每行一行类型,加上后续打印。

I do think you might be better off with Perl and its unpack feature, but awk can handle it too, albeit verbosely. 我认为使用Perl及其unpack功能可能会更好,但awk也可以处理它,尽管很冗长。

你可以使用Perl,然后根据该行的前两个字符选择一个解包模板吗?

更好地使用一些全功能的脚本语言,如perl或ruby。

What about 2 scripts? 两个脚本怎么样? Eg 1st script inserts field separators based on the first characters, then the 2nd should process it? 例如,第一个脚本根据第一个字符插入字段分隔符,然后第二个脚本应该处理它?

Or first of all define some function in your AWK script, which splits the lines into variables based on the input - I would go this way, for the possible re-usage. 或者首先在AWK脚本中定义一些函数,它根据输入将行拆分为变量 - 我会这样做,以便重新使用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM