使用PyParsing解析Snort日志

Question

Having a problem with parsing Snort logs using the pyparsing module. 使用pyparsing模块解析Snort日志时遇到问题。

The problem is with separating the Snort log (which has multiline entries, separated by a blank line) and getting pyparsing to parse each entry as a whole chunk, rather than read in line by line and expecting the grammar to work with each line (obviously, it does not.) 问题在于分离Snort日志（它有多行条目，用空行分隔）和获取pyparsing来解析每个条目作为整个块，而不是逐行读取并期望语法与每一行一起工作（显然，它不是。）

I have tried converting each chunk to a temporary string, stripping out the newlines inside each chunk, but it refuses to process correctly. 我已经尝试将每个块转换为临时字符串，剥离每个块内的换行符，但它拒绝正确处理。 I may be wholly on the wrong track, but I don't think so (a similar form works perfectly for syslog-type logs, but those are one-line entries and so lend themselves to your basic file iterator / line processing) 我可能完全走错了路，但我不这么认为（类似的形式适用于syslog类型的日志，但这些是单行条目，因此适合你的基本文件迭代器/行处理）

Here's a sample of the log and the code I have so far: 这是一个日志示例和我到目前为止的代码：

[**] [1:486:4] ICMP Destination Unreachable Communication with Destination Host is Administratively Prohibited [**]
[Classification: Misc activity] [Priority: 3] 
08/03-07:30:02.233350 172.143.241.86 -> 63.44.2.33
ICMP TTL:61 TOS:0xC0 ID:49461 IpLen:20 DgmLen:88
Type:3  Code:10  DESTINATION UNREACHABLE: ADMINISTRATIVELY PROHIBITED HOST FILTERED
** ORIGINAL DATAGRAM DUMP:
63.44.2.33:41235 -> 172.143.241.86:4949
TCP TTL:61 TOS:0x0 ID:36212 IpLen:20 DgmLen:60 DF
Seq: 0xF74E606
(32 more bytes of original packet)
** END OF DUMP

[**] ...more like this [**]

And the updated code: 更新的代码：

def snort_parse(logfile):
    header = Suppress("[**] [") + Combine(integer + ":" + integer + ":" + integer) + Suppress("]") + Regex(".*") + Suppress("[**]")
    cls = Optional(Suppress("[Classification:") + Regex(".*") + Suppress("]"))
    pri = Suppress("[Priority:") + integer + Suppress("]")
    date = integer + "/" + integer + "-" + integer + ":" + integer + "." + Suppress(integer)
    src_ip = ip_addr + Suppress("->")
    dest_ip = ip_addr
    extra = Regex(".*")

    bnf = header + cls + pri + date + src_ip + dest_ip + extra

    def logreader(logfile):
        chunk = []
        with open(logfile) as snort_logfile:
            for line in snort_logfile:
                if line !='\n':
                    line = line[:-1]
                    chunk.append(line)
                    continue
                else:
                    print chunk
                    yield " ".join(chunk)
                    chunk = []

    string_to_parse = "".join(logreader(logfile).next())
    fields = bnf.parseString(string_to_parse)
    print fields

Any help, pointers, RTFMs, You're Doing It Wrongs, etc., greatly appreciated. 任何帮助，指针，RTFM，你做错了等等，非常感谢。

Answer 1

import pyparsing as pyp
import itertools

integer = pyp.Word(pyp.nums)
ip_addr = pyp.Combine(integer+'.'+integer+'.'+integer+'.'+integer)

def snort_parse(logfile):
    header = (pyp.Suppress("[**] [")
              + pyp.Combine(integer + ":" + integer + ":" + integer)
              + pyp.Suppress(pyp.SkipTo("[**]", include = True)))
    cls = (
        pyp.Suppress(pyp.Optional(pyp.Literal("[Classification:")))
        + pyp.Regex("[^]]*") + pyp.Suppress(']'))

    pri = pyp.Suppress("[Priority:") + integer + pyp.Suppress("]")
    date = pyp.Combine(
        integer+"/"+integer+'-'+integer+':'+integer+':'+integer+'.'+integer)
    src_ip = ip_addr + pyp.Suppress("->")
    dest_ip = ip_addr

    bnf = header+cls+pri+date+src_ip+dest_ip

    with open(logfile) as snort_logfile:
        for has_content, grp in itertools.groupby(
                snort_logfile, key = lambda x: bool(x.strip())):
            if has_content:
                tmpStr = ''.join(grp)
                fields = bnf.searchString(tmpStr)
                print(fields)

snort_parse('snort_file')

yields 产量

[['1:486:4', 'Misc activity', '3', '08/03-07:30:02.233350', '172.143.241.86', '63.44.2.33']]

Answer 2

You have some regex unlearning to do, but hopefully this won't be too painful. 你有一些正则表达式无法学习，但希望这不会太痛苦。 The biggest culprit in your thinking is the use of this construct: 你思考的最大罪魁祸首就是使用这个结构：

some_stuff + Regex(".*") + 
                 Suppress(string_representing_where_you_want_the_regex_to_stop)

Each subparser within a pyparsing parser is pretty much standalone, and works sequentially through the incoming text. 一个pyparsing解析器中的每个subparser都是独立的，并按顺序通过传入的文本。 So the Regex term has no way to look ahead to the next expression to see where the '*' repetition should stop. 因此，正则表达式无法展望下一个表达式，以查看'*'重复应停止的位置。 In other words, the expression Regex(".*") is going to just read until the end of the line, since that is where ".*" stops without specifying multiline. 换句话说，表达式Regex(".*")将只读到行的结尾，因为这是".*"停止而不指定多行的地方。

In pyparsing, this concept is implemented using SkipTo. 在pyparsing中，这个概念是使用SkipTo实现的。 Here is how your header line is written: 以下是您的标题行的写法：

header = Suppress("[**] [") + Combine(integer + ":" + integer + ":" + integer) + 
             Suppress("]") + Regex(".*") + Suppress("[**]")

Your ".*" problem gets resolved by changing it to: 您的“。*”问题可以通过将其更改为：

header = Suppress("[**] [") + Combine(integer + ":" + integer + ":" + integer) + 
             Suppress("]") + SkipTo("[**]") + Suppress("[**]")

Same thing for cls. cls也是一样的。

One last bug, your definition of date is short by one ':' + integer: 最后一个错误，你的日期定义是一个'：'+整数：

date = integer + "/" + integer + "-" + integer + ":" + integer + "." + 
          Suppress(integer)

should be: 应该：

date = integer + "/" + integer + "-" + integer + ":" + integer + ":" + 
          integer + "." + Suppress(integer)

I think those changes will be sufficient to start parsing your log data. 我认为这些更改将足以开始解析您的日志数据。

Here are some other style suggestions: 以下是其他一些风格建议：

You have a lot of repeated Suppress("]") expressions. 你有很多重复的Suppress("]")表达式。 I've started defining all my suppressable punctuation in a very compact and easy to maintain statement like this: 我已经开始在一个非常紧凑和易于维护的语句中定义我所有可压缩的标点符号，如下所示：

LBRACK,RBRACK,LBRACE,RBRACE = map(Suppress,"[]{}")

(expand to add whatever other punctuation characters you like). （展开以添加您喜欢的任何其他标点符号）。 Now I can use these characters by their symbolic names, and I find the resulting code a little easier to read. 现在我可以通过它们的符号名称来使用这些字符，并且我发现生成的代码更容易阅读。

You start off header with header = Suppress("[**] [") + ... . 你用header = Suppress("[**] [") + ...开始标题。 I never like seeing spaces embedded in literals this way, as it bypasses some of the parsing robustness pyparsing gives you with its automatic whitespace skipping. 我从不喜欢以这种方式在文字中嵌入空格，因为它绕过了一些解析稳健性，pyparsing为你提供了自动空白跳过。 If for some reason the space between "[**]" and "[" was changed to use 2 or 3 spaces, or a tab, then your suppressed literal would fail. 如果由于某种原因“[**]”和“[”之间的空格被更改为使用2或3个空格或制表符，那么您的被抑制的文字将会失败。 Combine this with the previous suggestion, and header would begin with 将此与之前的建议相结合，标题将以

header = Suppress("[**]") + LBRACK + ...

I know this is generated text, so variation in this format is unlikely, but it plays better to pyparsing's strengths. 我知道这是生成的文本，所以这种格式的变化不太可能，但它对于pyparsing的优势更好。

Once you have your fields parsed out, start assigning results names to different elements within your parser. 解析完字段后，开始将结果名称分配给解析器中的不同元素。 This will make it a lot easier to get the data out afterward. 这将使人们更方便获取数据出来之后。 For instance, change cls to: 例如，将cls更改为：

cls = Optional(Suppress("[Classification:") + 
             SkipTo(RBRACK)("classification") + RBRACK)

Will allow you to access the classification data using fields.classification . 允许您使用fields.classification访问分类数据。

Answer 3

Well, I don't know Snort or pyparsing , so apologies in advance if I say something stupid. 好吧，我不知道Snort还是pyparsing ，所以如果我说些蠢话，请提前道歉。 I'm unclear as to whether the problem is with pyparsing being unable to handle the entries, or with you being unable to send them to pyparsing in the right format. 我不清楚问题是pyparsing无法处理条目，还是你无法以正确的格式将它们发送到pyparsing 。 If the latter, why not do something like this? 如果是后者，为什么不做这样的事情呢？

def logreader( path_to_file ):
    chunk = [ ]
    with open( path_to_file ) as theFile:
        for line in theFile:
            if line:
                chunk.append( line )
                continue
            else:
                yield "".join( *chunk )
                chunk = [ ]

Of course, if you need to modify each chunk before sending it to pyparsing , you can do so before yield ing it. 当然，如果你需要在将每个块发送到pyparsing之前修改它们，你可以在yield它之前这样做。

使用PyParsing解析Snort日志

问题描述

3 个解决方案

解决方案1
13 已采纳 2010-08-04 15:45:56

解决方案2
4 2010-08-04 15:53:30

解决方案3
0 2010-08-04 14:36:13

使用PyParsing解析Snort日志

问题描述

3 个解决方案

解决方案1 13 已采纳 2010-08-04 15:45:56

解决方案2 4 2010-08-04 15:53:30

解决方案3 0 2010-08-04 14:36:13

解决方案1
13 已采纳 2010-08-04 15:45:56

解决方案2
4 2010-08-04 15:53:30

解决方案3
0 2010-08-04 14:36:13