简体   繁体   English

带有正则表达式的 awk 中的自定义字段分隔符:前导空格和逗号问题

[英]Custom field separators in awk with regex: problem with leading spaces and commas

I have looked around on stackoverflow for a solution, I have found some related answers, but I haven't been able to find a clear solution to my doubt.我在 stackoverflow 上四处寻找解决方案,我找到了一些相关的答案,但我一直无法找到明确的解决方案来解决我的疑问。 I hope I am not asking a duplicate question.我希望我不会问重复的问题。

Let us consider a file让我们考虑一个文件

cat > file << EOF
1 2  3 4, 5,, 6, 7
EOF

I want to use as separator an arbitrary number of commas and spaces.我想使用任意数量的逗号和空格作为分隔符。 With awk, setting the field separator with F"[ ,]*", I obtain the desired result, ie:使用 awk,用 F"[ ,]*" 设置字段分隔符,我得到了想要的结果,即:

awk -F"[ ,]+" '{print $1}' file --> 1
awk -F"[ ,]+" '{print $2}' file --> 2
awk -F"[ ,]+" '{print $3}' file --> 3
awk -F"[ ,]+" '{print $4}' file --> 4
awk -F"[ ,]+" '{print $5}' file --> 5
awk -F"[ ,]+" '{print $6}' file --> 6
awk -F"[ ,]+" '{print $7}' file --> 7

However, if I have leading spaces I have a problem.但是,如果我有前导空格,我就会遇到问题。 For example:例如:

with one leading space有一个领先的空间

cat > file << EOF
 1 2  3 4, 5,, 6, 7
EOF

I obtain我得到

awk -F"[ ,]+" '{print $1}' file -->   
awk -F"[ ,]+" '{print $2}' file --> 1
awk -F"[ ,]+" '{print $3}' file --> 2
...

with two leading spaces the same两个前导空格相同

cat > file << EOF
  1 2  3 4, 5,, 6, 7
EOF
awk -F"[ ,]+" '{print $1}' file -->  
awk -F"[ ,]+" '{print $2}' file -->  1
awk -F"[ ,]+" '{print $3}' file -->  2
...

and so forth.等等。

However, the problem is not only with the spaces.然而,问题不仅在于空间。 For example with例如与

cat > file << EOF
1,2,3,
EOF

I have我有

awk -F"," '{print $1}' file -->   1
awk -F"," '{print $2}' file -->   2
awk -F"," '{print $3}' file -->   3
awk -F"," '{print $4}' file -->  

which is what I expect, but with这是我所期望的,但与

cat > file << EOF
,1,2,3
EOF

I get我得到

awk -F"," '{print $1}' file -->   
awk -F"," '{print $2}' file -->   1
awk -F"," '{print $3}' file -->   2
awk -F"," '{print $4}' file -->   3

and I do not understand why.我不明白为什么。

It seems that awk is treating the leading separators in a different way. awk 似乎以不同的方式处理前导分隔符。 Probably, I have misunderstood the regex syntax.可能,我误解了正则表达式语法。 Indeed, I do not understand why setting -F" " the leading spaces are treated in the proper way, whereas setting -F"[ ]*" I have the same problem.事实上,我不明白为什么设置 -F"" 前导空格会以正确的方式处理,而设置 -F"[ ]*" 我有同样的问题。

In conclusion, these are my questions: why I am obtaining those results for leading spaces or leading commas, and what is the correct syntax to consider, as field separators, any number of commas and spaces, regardless if they are leading or not.总之,这些是我的问题:为什么我会获得前导空格或前导逗号的这些结果,以及要考虑的正确语法是什么,作为字段分隔符,任意数量的逗号和空格,无论它们是否前导。

Yes, there is some inconsistency, I guess for convenience.是的,有一些不一致,我想是为了方便。

The default delimiter ignores the leading/trailing spaces默认分隔符忽略前导/尾随空格

$ echo " 1 2 " | awk '{for(i=1;i<=NF;i++) print i"--> "$i}'
1--> 1
2--> 2

setting FS to space behaves the sameFS设置为 space 的行为相同

$ echo " 1 2 " | awk -F' ' '{for(i=1;i<=NF;i++) print i"--> "$i}'
1--> 1
2--> 2

however, the char set.但是,字符集。

$ echo " 1 2 " | awk -F'[ ]' '{for(i=1;i<=NF;i++) print i"--> "$i}'
1-->
2--> 1
3--> 2
4-->

since there are 3 delimiters (so assumes four fields).因为有 3 个分隔符(所以假设有四个字段)。 With the comma delimiter you won't get the default behavior, but just the last version.使用逗号分隔符,您将不会获得默认行为,而只会获得最后一个版本。

If you want to mimic the default behavior for both comma and space, you need to write your own handling, something like this如果你想模仿逗号和空格的默认行为,你需要编写自己的处理方式,像这样

$ echo "   ,1 2," | 
  awk -F'[ ,]+' 'NF{if($1=="") {for(i=2;i<=NF;i++) $(i-1)=$i; NF--} 
                    if($NF=="") NF--} 
                   {for(i=1;i<=NF;i++) print i"--> "$i}'

1--> 1
2--> 2

Explanation: if the first field is empty, shift all the fields to one left, reduce the field count by one;说明:如果第一个字段为空,则将所有字段左移一位,字段数减一; similarly if the last field is empty, simply reduce the field count.同样,如果最后一个字段为空,只需减少字段数。 The last statement is for printing the fields one at a line by field position number.最后一条语句用于按字段位置编号一行打印字段。

Update to handle empty lines add a guard NF before attempting to fix the fields.在尝试修复字段之前,更新以处理空行添加保护NF

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM