awk正则表达式：使用它或不使用变量之间的区别

Question

I have an awk script that behaves different when I put a regular expression in different places. 当我将正则表达式放在不同位置时，我有一个awk脚本，其行为会有所不同。 Obviously I make the logic of the program to work the same in both cases, but it does not. 显然，我使程序的逻辑在两种情况下均相同，但事实并非如此。 The script is for analyzing some logs where each transaction has an unique ID. 该脚本用于分析某些日志，其中每个事务具有唯一的ID。 The log looks like 日志看起来像

timestamp (ID) more info

for example: 例如：

2014-10-06 05:24:40,035 INFO  (4aaaaaaaaabbbbbbcccb) [somestring] body with real information and a key string that determines the type of thransaction
2014-10-06 05:24:40,035 INFO  (4aaaaaaaaabbbbbbcccb) [somestring] body with other information
2014-10-06 05:24:40,035 INFO  (4aaaaaaaaabbbbbbcccb) [somestring] body with more information
2014-10-06 05:24:40,035 INFO  (4xxbbbbbbbbbbbbbcccb) [somestring] this is a different transaction

What I want is to process all the log lines of a certain type of transaction to see how much time do they take. 我想要的是处理某种事务类型的所有日志行，以查看它们需要花费多少时间。 Each transaction is spread across several log lines and its identified by its unique ID. 每个事务分布在多个日志行中，并由其唯一ID标识。 To know if a certain transaction is of the type I want I have to search for certain string in the first line of that transaction . 要知道某个交易是否属于我想要的类型，我必须在该交易的第一行中搜索某个字符串。 In the log could be lines without the above format. 日志中可能是没有上述格式的行。

What do I want: 我想要什么：

Distinguish if the current line is part of a transaction (it has an ID) 区分当前行是否是事务的一部分（它具有ID）
Check if the ID is already registered in an cumulative array. 检查ID是否已在累积数组中注册。
- If not, check if it is of the desired type: search for a fixed string in the body of the line. 如果不是，请检查其是否为所需的类型：在行的正文中搜索固定的字符串。
- If it is, register the timestamp, and blah blah 如果是这样，请注册时间戳，等等

And here is the code (note this is a very minified version). 这是代码（请注意，这是一个非常小的版本）。

This is what I would like to use, first check if it is a transaction line and after check if it is of the correct type 这就是我要使用的方法，首先检查它是否为交易行，然后检查其是否为正确类型

awk '$4 ~ /^\([:alnum:]/
{
  name=$4;gsub(/[()]|:.*/,"",name);++matched
  if(!(name in arr)){
    if($0 ~ /transaction type/){arr[name]=1;print name}}
}END
{
  print "Found :"length(arr)
  print "Processed "NR
  print matched" lines matched the filter"
}'

That script only finds 868 transactions and there are some more than 14K. 该脚本仅能找到868个事务，并且超过14K。 If I change the script to look like the code below if finds all the 14k transactions, but only the first line of all of them, so it is not useful for me. 如果我更改脚本以使其看起来像下面的代码，则它找到所有的14k事务，但仅找到所有事务的第一行，因此对我没有用。

awk '/transaction type/
{
  name=$4;gsub(/[()]|:.*/,"",name);++matched
  if(!(name in arr)){
    arr[name]=1;print name
   }
}END
{
  print "Found :"length(arr)
  print "Processed "NR
  print matched" lines matched the filter"
}'

Thanks in advance. 提前致谢。

Edit 编辑

Shame on me. 真可惜 There were more than one actual problem in this topic. 这个主题中有多个实际问题。 The main one was that the regex was not matching the proper string. 最主要的是正则表达式与正确的字符串不匹配。 The ID string and the type of transaction string were on the same line, that is true, but on those lines the ID was like (aaaaaabbbbbcccc: ), with two spaces at the end. ID字符串和事务处理字符串的类型在同一行上，这是正确的，但是在这些行上，ID就像（aaaaaabbbbbcccc：），在末尾有两个空格。 That makes AWK to parse "(aaaaaaaabbbbcccc:" and ")" as two different fields. 这使AWK将“（aaaaaaaabbbbcccc：”和“）”解析为两个不同的字段。 I realized when I did 我意识到当我做了

$4 !~ /regex/ print $4

and a lot of valid IDs appeared. 并且出现了许多有效的ID。

The second problem, which appeared after fixing the regular expression have been addressed by some people here. 固定正则表达式后出现的第二个问题已由此处的一些人解决。 Having the main regular expression and the firs { in separated lines makes awk to print each record. 将主正则表达式和firs {放在单独的行中可以使awk打印每条记录。 I realized that myself and the same day later I read here the solutions. 我意识到自己和当天后在这里阅读了解决方案。 Amazing. 惊人。

Thank you very much to every one. 非常感谢大家。 I can only accept one answer as valid, but I learned a lot from all of them. 我只能接受一个有效的答案，但是我从所有答案中学到了很多。

Answer 1

It's only a syntax error. 这只是语法错误。 When you use a posix character class you must enclose it between square brackets: 使用posix字符类时，必须将其括在方括号之间：

[[:alnum:]]

Otherwise [:alnum:] is seen as a character class that contains : almnu 否则[:alnum:]被视为包含: almnu的字符类: almnu

Answer 2

white space matters in awk. 空白在awk中很重要。 This: 这个：

/foo/ {
    print "found"
}

means print 'found' every time "foo" is present while this: 表示print 'found' every time "foo" is present而这print 'found' every time "foo" is present这样的：

/foo/
{
    print "found"
}

means print the current record every time "foo" is present and print "found" for every single input record so chances are when you wrote: 表示print the current record every time "foo" is present and print "found" for every single input record因此在您编写时很可能是：

$4 ~ /^\([:alnum:]/
{
     ....
}

you actually meant to write: 您实际上打算写：

$4 ~ /^\([:alnum:]/ {
     ....
}

also, chances are you meant to use the POSIX character class [[:alnum:]] instead of the set of characters [ : alnum as described by the character set [:alnum:] : 还，机会是你的意思是使用POSIX字符类[[:alnum:]]代替字符集合[ : alnum由字符集描述[:alnum:] ：

$4 ~ /^\([[:alnum:]]/ {
     ....
}

If you fix those things and you still need help, provide some testable sample input and expected output we can help you more. 如果您解决了这些问题，但仍需要帮助，请提供一些可测试的示例输入和预期输出，我们可以为您提供更多帮助。

Answer 3

So in brief if I understood properly you wish to get ids of certain type of transactions. 简而言之，如果我理解正确，那么您希望获得某种交易类型的ID。

First assumption: id and transaction type are on the same line, something like this should do (largely adapted from your code) 第一个假设：id和事务类型在同一行上，应该这样做（主要根据您的代码改编）

awk 'BEGIN {
  matched=0 # more for clarity than really needed
}
/\([[:alnum:]]*\).*transaction type/ { # get lines matching the id and the transaction only
  gsub(/[()]/,"",$4) # strip the () around the id
  ++matched # to get the number of matched lines including the multiples ones.
  if (!($4 in arr)) { # as yours, if the id is not in array
    arr[$4]=1 # add the found id to array for no including it twice
    print $4 # print the found id (only once as we're in the if
  }
}
END { # nothing changed here, printing the stats...
  print "Found :"length(arr)
  print "Processed "NR
  print matched" lines matched the filter"
}'

Output of this from your sample input: 您的样本输入的输出：

prompt=> awk 'BEGIN { matched=0}; / \([a-z0-9]*\) / { gsub(/[()]/,"",$4); ++matched; if (!($4 in arr)) { arr[$4]=1; print $4 }}; END { print "Found: "length(arr)"\nProcessed "NR"\n"matched" lines matched the filter" }' awkinput
4aaaaaaaaabbbbbbcccb
4xxbbbbbbbbbbbbbcccb
Found: 2
Processed 4
4 lines matched the filter

I've ommitted the transaction in the test as I've no clue on what it may be 我在测试中省略了交易，因为我不知道它可能是什么

awk正则表达式：使用它或不使用变量之间的区别

问题描述

Edit 编辑

3 个解决方案

解决方案1
3 2014-10-06 13:16:36

解决方案2
3 已采纳 2014-10-06 13:24:48

解决方案3
2 2014-10-06 13:19:12

awk正则表达式：使用它或不使用变量之间的区别

问题描述

Edit 编辑

3 个解决方案

解决方案1 3 2014-10-06 13:16:36

解决方案2 3 已采纳 2014-10-06 13:24:48

解决方案3 2 2014-10-06 13:19:12

解决方案1
3 2014-10-06 13:16:36

解决方案2
3 已采纳 2014-10-06 13:24:48

解决方案3
2 2014-10-06 13:19:12