简体   繁体   English

awk模式可以匹配多行吗?

[英]Can awk patterns match multiple lines?

I have some complex log files that I need to write some tools to process them. 我有一些复杂的日志文件,需要编写一些工具来处理它们。 I have been playing with awk but I am not sure if awk is the right tool for this. 我一直在玩awk,但不确定awk是否适合此工具。

My log files are print outs of OSPF protocol decodes which contain a text log of the various protocol pkts and their contents with their various protocol fields identified with their values. 我的日志文件是OSPF协议解码的打印输出,其中包含各种协议pkts及其内容的文本日志,以及用值标识的各种协议字段。 I want to process these files and print out only certain lines of the log that pertain to specific pkts. 我想处理这些文件并仅打印出与特定pkts有关的日志的某些行。 Each pkt log can consist of a varying number of lines for that pkt's entry. 每个pkt日志可以包含该pkt条目的不同行数。

awk seems to be able to process a single line that matches a pattern. awk似乎能够处理与模式匹配的一行。 I can locate the desired pkt but then I need to match patterns in the lines that follow in order to determine if it is a pkt I want to print out. 我可以找到所需的pkt,但是接下来我需要在后面的行中匹配模式,以确定它是否是我要打印的pkt。

Another way to look at this is that I would want to isolate several lines in the log file and print out those lines that are the details of a particular pkt based on pattern matches on several lines. 另一种看待这种情况的方式是,我想隔离日志文件中的几行,并根据几行上的模式匹配来打印出这些行,这些行是特定pkt的详细信息。

Since awk seems to be line-based, I am not sure if that would be the best tool to use. 由于awk似乎是基于行的,因此我不确定这是否是最好的工具。

If awk can do this, how it is done? 如果awk可以做到这一点,怎么做? If not, any suggestions on which tool to use for this? 如果没有,关于使用哪种工具的任何建议?

Awk can easily detect multi-line combinations of patterns, but you need to create what is called a state machine in your code to recognize the sequence. Awk可以轻松检测模式的多行组合,但是您需要在代码中创建所谓的状态机以识别序列。

Consider this input: 考虑以下输入:

how
second half #1
now
first half
second half #2
brown
second half #3
cow

As you have seen, it's easy to recognize a single pattern. 如您所见,很容易识别单个模式。 Now, we can write an awk program that recognizes second half only when it is directly preceded by a first half line. 现在,我们可以编写一个awk程序,该程序仅在前一半行直接位于后一半时才识别后一半 (With a more sophisticated state machine you could detect an arbitrary sequence of patterns.) (使用更复杂的状态机,您可以检测到任意序列的模式。)

/second half/ {
  if(lastLine == "first half") {
    print
  }
}

{ lastLine = $0 }

If you run this you will see: 如果运行此命令,您将看到:

second half #2

Now, this example is absurdly simple and only barely a state machine. 现在,这个例子非常简单,几乎没有状态机。 The interesting state lasts only for the duration of the if statement and the preceding state is implicit, depending on the value of lastLine. 有趣的状态仅在if语句的持续时间内持续,而前一个状态是隐式的,具体取决于lastLine的值 In a more canonical state machine you would keep an explicit state variable and transition from state-to-state depending on both the existing state and the current input. 在更规范的状态机中,您将保留一个显式状态变量,并根据现有状态和当前输入从一个状态转换到另一个状态。 But you may not need that much control mechanism. 但是您可能不需要那么多的控制机制。

Awk is really record-based. Awk实际上是基于记录的。 By default it thinks of a line as a record, but you can alter that with the RS (record separator) variable. 默认情况下,它会将一行视为一条记录,但是您可以使用RS(记录分隔符)变量对其进行更改。

One way to approach this would be to do a first pass using sed (you could do this with awk, too, if you prefer), to separate the records with a different character like a form-feed. 解决此问题的一种方法是使用sed进行第一遍(如果愿意,也可以使用awk这样做),以使用换页符等不同字符来分隔记录。 Then you can write your awk script where it will treat the group of lines as a single record. 然后,您可以编写awk脚本,在该脚本中将这组行视为一条记录。

For example, if this is your data: 例如,如果这是您的数据:

animal 0
name: joe
type: dog
animal 1
name: bill
type: cat
animal 2
name: ed
type: cat

To separate the records with form-feeds: 要使用换页分隔记录:

$ cat data | sed $'s|^\(animal.*\)|\f\\1|'

Now we'll take that and pass it through awk. 现在,我们将其接受并通过awk。 Here's an example of conditionally printing a record: 这是有条件地打印记录的示例:

$ cat data | sed $'s|^\(animal.*\)|\f\\1|' | awk '
      BEGIN { RS="\f" }                                     
      /type: cat/ { print }'

outputs: 输出:

animal 1
name: bill
type: cat

animal 2
name: ed
type: cat

Edit: as a bonus, here's how to do it with awk-ward ruby (-014 means use form-feed (octal code 014) as the record separator): 编辑:作为奖励,这是使用awk-ward ruby​​的方法(-014表示使用换页(八进制代码014)作为记录分隔符):

$ cat data | sed $'s|^\(animal.*\)|\f\\1|' |
      ruby -014 -ne 'print if /type: cat/'

awk is able to process from start pattern until end pattern awk能够从开始模式处理到结束模式

/start-pattern/,/end-pattern/ {
  print
}

I was looking for how to match 我一直在寻找如何搭配

 * Implements hook_entity_info_alter().
 */
function file_test_entity_type_alter(&$entity_types) {

so created 如此创造

/\* Implements hook_/,/function / {
  print
}

which the content I needed. 我需要哪些内容。 A more complex example is to skip lines and scrub off non-space parts. 一个更复杂的示例是跳过线条并擦掉非空间部分。 Note awk is a record(line) and word(split by space) tool. 注意awk是一个记录(行)和单词(由空格分割)工具。

# start,end pattern match using comma
/ \* Implements hook_(.*?)\./,/function (.\S*?)/ {
  # skip PHP multi line comment end
  $0 ~ / \*\// skip

  # Only print 3rd word
  if ($0 ~ /Implements/) {
    hook=$3
    # scrub of opening parenthesis and following.
    sub(/\(.*$/, "", hook)
    print hook
  }

  # Only print function name without parenthesis
  if ($0 ~ /function/) {
    name=$2

    # scrub of opening parenthesis and following.
    sub(/\(.*$/, "", name)

    print name
    print ""
  }
}

Hope this helps too. 希望这也会有所帮助。

See also ftp://ftp.gnu.org/old-gnu/Manuals/gawk-3.0.3/html_chapter/gawk_toc.html 另请参见ftp://ftp.gnu.org/old-gnu/Manuals/gawk-3.0.3/html_chapter/gawk_toc.html

I do this sort of thing with sendmail logs, from time to time. 我经常用sendmail日志来做这种事情。

Given: 鉴于:

Jan 15 22:34:39 mail sm-mta[36383]: r0B8xkuT048547: to=<www@web3>, delay=4+18:34:53, xdelay=00:00:00, mailer=esmtp, pri=21092363, relay=web3., dsn=4.0.0, stat=Deferred: Operation timed out with web3.
Jan 15 22:34:39 mail sm-mta[36383]: r0B8hpoV047895: to=<www@web3>, delay=4+18:49:22, xdelay=00:00:00, mailer=esmtp, pri=21092556, relay=web3., dsn=4.0.0, stat=Deferred: Operation timed out with web3.
Jan 15 22:34:51 mail sm-mta[36719]: r0G3Youh036719: from=<obfTaIX3@nickhearn.com>, size=0, class=0, nrcpts=0, proto=ESMTP, daemon=IPv4, relay=[50.71.152.178]
Jan 15 22:35:04 mail sm-mta[36722]: r0G3Z2SF036722: lost input channel from [190.107.98.82] to IPv4 after rcpt
Jan 15 22:35:04 mail sm-mta[36722]: r0G3Z2SF036722: from=<amahrroc@europe.com>, size=0, class=0, nrcpts=0, proto=SMTP, daemon=IPv4, relay=[190.107.98.82]
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: lost input channel from ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged) to IPv4 after rcpt
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: from=<clunch.hilarymas@javagame.ru>, size=0, class=0, nrcpts=0, proto=SMTP, daemon=IPv4, relay=ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged)

I use a script something like this: 我使用这样的脚本:

#!/usr/bin/awk -f

BEGIN {
  search=ARGV[1];  # Grab the first command line option
  delete ARGV[1];  # Delete it so it won't be considered a file
}

# First, store every line in an array keyed on the Queue ID.
# Obviously, this only works for smallish log segments, as it uses up memory.
{
  line[$6]=sprintf("%s\n%s", line[$6], $0);
}

# Next, keep a record of Queue IDs with substrings that match our search string.
index($0, search) {
  show[$6];
}

# Finally, once we've processed all input data, walk through our array of "found"
# Queue IDs, and print the corresponding records from the storage array.
END {
  for(qid in show) {
    print line[qid];
  }
}

to get the following output: 得到以下输出:

$ mqsearch airtel /var/log/maillog

Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: lost input channel from ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged) to IPv4 after rcpt
Jan 15 22:35:36 mail sm-mta[36728]: r0G3ZXiX036728: from=<clunch.hilarymas@javagame.ru>, size=0, class=0, nrcpts=0, proto=SMTP, daemon=IPv4, relay=ABTS-TN-dynamic-237.104.174.122.airtelbroadband.in [122.174.104.237] (may be forged)

The idea here is that I'm printing all lines that match the Sendmail Queue ID of the string I want to search for. 这里的想法是,我将打印与要搜索的字符串的Sendmail队列ID匹配的所有行。 The structure of the code is of course a product of the structure of the log file, so you'll need to customize your solution for the data you're trying to analyse and extract. 代码的结构当然是日志文件结构的产物,因此您需要针对要分析和提取的数据自定义解决方案。

`pcregrep -M` works pretty well for this.

From pcregrep(1): 从pcregrep(1):

-M, --multiline -M,-多行

Allow patterns to match more than one line. 允许模式匹配多行。 When this option is given, patterns may usefully contain literal newline characters and internal occurrences of ^ and $ characters. 给出此选项后,模式可能会有用地包含文字换行符以及内部出现的^和$字符。 The output for a successful match may consist of more than one line, the last of which is the one in which the match ended. 成功匹配的输出可能包含多行,最后一行是匹配结束的那一行。 If the matched string ends with a newline sequence the output ends at the end of that line. 如果匹配的字符串以换行符结尾,则输出在该行的结尾处结束。

When this option is set, the PCRE library is called in “multiline” mode. 设置此选项后,将以“多行”模式调用PCRE库。 There is a limit to the number of lines that can be matched, imposed by the way that pcregrep buffers the input file as it scans it. pcregrep在扫描输入文件时对其进行缓冲的方式对可匹配的行数进行了限制。 However, pcregrep ensures that at least 8K characters or the rest of the document (whichever is the shorter) are available for forward matching, and similarly the previous 8K characters (or all the previous characters, if fewer than 8K) are guaranteed to be available for lookbehind assertions. 但是,pcregrep确保至少有8K个字符或文档的其余部分(以较短者为准)可用于正向匹配,并且类似地,保证前8K个字符(或所有先前的字符,如果少于8K)也可用。对于后置断言。 This option does not work when input is read line by line (see --line-buffered.) 当逐行读取输入时,此选项不起作用(请参阅--line-buffered。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM