简体   繁体   English

在Awk中使用RegExp

[英]Use RegExp in Awk

I have CSV file: 我有CSV文件:

<iframe src="https://localhost/get/44bc40f3bc04f65b7a35"></iframe>|name_1
<iframe src="https://localhost/get/5db0d477d707121934ff"></iframe>|name_2
<iframe src="https://localhost/get/6c95bd2b32ed45989c61"></iframe>|name_3
<iframe src="https://localhost/get/0a9c4655800e8a7b9ea2"></iframe>|name_4
<iframe src="https://localhost/get/754953b57a32e2841bda"></iframe>|name_5

and want use RegExp and Awk (or Gawk) to make this CSV file like this: 并想使用RegExp和Awk(或Gawk)使此CSV文件如下所示:

44bc40f3bc04f65b7a35|name_1
5db0d477d707121934ff|name_2
6c95bd2b32ed45989c61|name_3
0a9c4655800e8a7b9ea2|name_4
754953b57a32e2841bda|name_5

I have worked RegExp in Grep 我在Grep中工作过RegExp

$ grep -Po "[A-Za-z]*+\d++\w++" example.txt 
44bc40f3bc04f65b7a35
5db0d477d707121934ff
6c95bd2b32ed45989c61
0a9c4655800e8a7b9ea2
754953b57a32e2841bda

but this RegExp not work in Awk. 但是此RegExp在Awk中不起作用。 I think i'm not correct use regexp in Awk or this type of RegExp not worked in Awk. 我认为我在Awk中使用正则表达式不正确,或者这种类型的RegExp在Awk中不起作用。

$ awk -F "|" 'match($1, /[A-Za-z]*+\d++\w++/, a) {print a[0]"|"$2}' example.txt 
db0d477d707121934ff|name_2
bd2b32ed45989c61|name_3
bda|name_5

Just Awk work fine: 只是Awk工作正常:

$ awk -F "|" '{print $1"|"$2}' example.txt 
<iframe src="https://localhost/get/44bc40f3bc04f65b7a35"></iframe>|name_1
<iframe src="https://localhost/get/5db0d477d707121934ff"></iframe>|name_2
<iframe src="https://localhost/get/6c95bd2b32ed45989c61"></iframe>|name_3
<iframe src="https://localhost/get/0a9c4655800e8a7b9ea2"></iframe>|name_4
<iframe src="https://localhost/get/754953b57a32e2841bda"></iframe>|name_5

Try: 尝试:

$ awk -F'<iframe src="https://localhost/get/|"></iframe>' '{print $2 $3}' file
44bc40f3bc04f65b7a35|name_1
5db0d477d707121934ff|name_2
6c95bd2b32ed45989c61|name_3
0a9c4655800e8a7b9ea2|name_4
754953b57a32e2841bda|name_5

This works by setting the input field separator to be either iframe src="https://localhost/get/ or "></iframe> in which case the output you want is the second field followed by the third field. 这可以通过将输入字段分隔符设置 iframe src="https://localhost/get/"></iframe>在这种情况下,所需的输出是第二个字段,然后是第三个字段。

Alternative Method Using Match 使用匹配的替代方法

$ awk -F "|" 'match($1, /[[:xdigit:]]{20}/, a) {print a[0]"|"$2}' file
44bc40f3bc04f65b7a35|name_1
5db0d477d707121934ff|name_2
6c95bd2b32ed45989c61|name_3
0a9c4655800e8a7b9ea2|name_4
754953b57a32e2841bda|name_5

Note that awk supports POSIX regular expressions. 请注意,awk支持POSIX正则表达式。 That means that it recognizes character classes like [[:digit:]] or [[:alnum:]] but not necessarily \\d or \\w . 这意味着它可以识别字符类,例如[[:digit:]][[:alnum:]]但不一定\\d\\w As a GNU-specific extension, gawk supports \\w (but not \\d ). 作为GNU特定的扩展,gawk支持\\w (但不支持\\d )。 For portability, stick to the POSIX classes as shown in man 7 regex . 为了实现可移植性,请遵循man 7 regex所示的POSIX类。

Yet another method 另一种方法

Your regex [A-Za-z]*+\\d++\\w++ can be translated into awk as follows: 您的正则表达式[A-Za-z]*+\\d++\\w++可以转换为awk,如下所示:

$ awk -F "|" 'match($1, /[[:alpha:]]*[[:digit:]]+[[:alnum:]]+/, a) {print a[0]"|"$2}' file
44bc40f3bc04f65b7a35|name_1
5db0d477d707121934ff|name_2
6c95bd2b32ed45989c61|name_3
0a9c4655800e8a7b9ea2|name_4
754953b57a32e2841bda|name_5

Note that this method requires that the 20-character hex string contains at least one digit. 请注意,此方法要求20个字符的十六进制字符串至少包含一位数字。

  1. The difference between awk and grep invocations in your example is -P option in grep, which stands for "Use Perl regexp". 在您的示例中,awk和grep调用之间的区别是grep中的-P选项,代表“使用Perl regexp”。 If you replace it with -E, it will work just like your awk run. 如果将其替换为-E,它将像执行awk一样工作。 Awk does not support Perl extension. Awk不支持Perl扩展。

  2. Your regexp is better be fixed, I don't think you need these extra + signs, to begin with. 最好将您的正则表达式固定,我认为您不需要这些多余的符号。 If I can assume that you need all letters or digits after get/ then I'd rather write: 如果我可以假设您在get /之后需要所有字母或数字,那么我宁愿写:

    awk -F "|" awk -F“ |” 'match($1, /get/([A-Za-z0-9]+)/, a) {print a[1]"|"$2}' example.txt 'match($ 1,/ get /([[A-Za-z0-9] +)/,a){print a [1]“ ||” $ 2}'example.txt

Here we use [A-Za-z0-9]+ match any number of small or upper letters or digits that come after /get, use a[1] to print a matched group inside the parentheses instead of the whole matching pattern a[0] which includes get/ 在这里,我们使用[A-Za-z0-9] +匹配/ get之后的任意数量的小写或大写字母或数字,使用a [1]在括号内打印匹配的组,而不是整个匹配模式a [ 0],其中包括get /

awk '{gsub(/<.*get\//,"")gsub(/".*e>/,"")}1'  file

44bc40f3bc04f65b7a35|name_1
5db0d477d707121934ff|name_2
6c95bd2b32ed45989c61|name_3
0a9c4655800e8a7b9ea2|name_4
754953b57a32e2841bda|name_5

Here is another solution: 这是另一种解决方案:

awk -F"[/\">|]" 'BEGIN{ OFS = "|" }{ print $6, $11 }' yourfile

With the -F option in the beginning the Field Separator can be /, ", > and |. After that is done you can just print the fields $6 and $11 which contain your desired output together with the output field separator. 开头使用-F选项,字段分隔符可以是/,“,>和|。完成后,您只需打印包含所需输出的$ 6和$ 11字段以及输出字段分隔符即可。

Output: 输出:

44bc40f3bc04f65b7a35|name_1
5db0d477d707121934ff|name_2
6c95bd2b32ed45989c61|name_3
0a9c4655800e8a7b9ea2|name_4
754953b57a32e2841bda|name_5

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM