简体   繁体   中英

Regular expression to parse log file

I have a SonicWall syslog file with this format:

<134>id=firewall sn=C0EAE470F7D0 time="2014-08-13 04:31:27" fw=10.2.3.4 pri=6 c=1024 m=537 msg="Connection Closed" n=301541 src=172.16.1.43:50581:X0 dst=172.16.1.1:192:X0 proto=udp/192 sent=46

I am trying to create a regular expression that will return a list of tuples that are split on the = sign. If a value contains spaces, it will have double quotes. I don't care if the values returned have the quotes returned or not, as long as the entire value with spaces is returned. For example, I want the time key to contain both the date & time. Desired output:

("<134>id","firewall"), ("sn","C0EAE470F7D0"), ("time", '"2014-08-13 04:31:27"')
("fw","1.2.3.4"), ("pri","6"), ... ("msg", '"Connection Closed"'), ("n", "301541")
("src","172.16.1.43:50581:X0"), ... ("sent", "46")

This is what I have so far, but fails when a field with double quotes is encountered. Also, the last field, "sent" in this case, is not returned. I have experimented with the RE for a few hours trying various combinations, but just can't quite get this to work. Any help would be greatly appreciated.

import re
fname = "syslog.log"
with open(fname) as fp: lines = fp.read().splitlines()
q = re.compile('(.*?)=(.*?)[\s"]',re.S|re.M)
for line in lines:
    print(line)
    key_val = q.findall(line)
    print(key_val)

This is what this code returns:

[('<134>id', 'firewall'), ('sn', 'C0EAE470F7D0'), ('time', ''), 
('2014-08-13 04:31:27" fw', '10.2.3.4'), ('pri', '6'),
('c', '1024'), ('m', '537'), ('msg', ''), 
('Connection Closed" n', '301541'), ('src', '172.16.1.43:50581:X0'), 
('dst', '172.16.1.1:192:X0'), ('proto', 'udp/192')]

If this can't be accomplished with a regular expression, what would be the best way to achieve the desired result in Python 3.3?

http://regex101.com/r/wS5lX2/3

(.+?)=("[^"]*"|\\S*)\\s*

What it does

  1. Match anything that's not an equals sign up to the equals sign
  2. Match either
    1. Quotes around a string that does not contains quotes or
    2. A string without spaces
  3. Match whitespace

If you additionally want to remove the quotes around the match, you can use this instead

http://regex101.com/r/wS5lX2/4

(.+?)=(?:"(.*?)(?<!\\\\)"|(\\S*))\\s*

It removes the double quote from the match string. The key will be group 1 and the value will be group 2 or 3. Additionally, it allows you to have backslash-escape quotes inside your quoted value.

This will be easier if you grab all the tokens first, then split them.

import re
txt = """<134>id=firewall sn=C0EAE470F7D0 time="2014-08-13 04:31:27" fw=10.2.3.4 pri=6 c=1024 m=537 msg="Connection Closed" n=301541 src=172.16.1.43:50581:X0 dst=172.16.1.1:192:X0 proto=udp/192 sent=46"""

tokens = re.findall(r'''\S+=(?:"[^"]+?")|(?:'[^']+?')|\S+=\S+''', txt)

end_result = list(map(lambda x: tuple(x.split('=')), tokens))
# output:
[('<134>id', 'firewall'), ('sn', 'C0EAE470F7D0'), ('time', '"2014-08-13 04:31:27"'), ('fw', '10.2.3.4'), ('pri', '6'), ('c', '1024'), ('m', '537'), ('msg', '"Connection Closed"'), ('n', '301541'), ('src', '172.16.1.43:50581:X0'), ('dst', '172.16.1.1:192:X0'), ('proto', 'udp/192'), ('sent', '46')]

explained:

re.compile('''
    \S+               # match one or more non-space characters
    =                 # match a literal equals
    (?:"[^"]+?")|     # match a double quotation and its contents OR
    (?:'[^']+?')      # match a single quotation and its contents
    |                 # OR
    \S+               # match one or more non-space characters
    =                 # match a literal equals
    \S+               # match one or more non-space characters
''', re.X)

This gives the output you want (and also strips quotes):

line = """
<134>id=firewall sn=C0EAE470F7D0 time="2014-08-13 04:31:27" fw=10.2.3.4 pri=6 c=1024 m=537 msg="Connection Closed" n=301541 src=172.16.1.43:50581:X0 dst=172.16.1.1:192:X0 proto=udp/192 sent=46
"""

rx = r"""(?x)
    (\w+) =
    (?:
        " ([^"]*) "
        |
        (\S+)
    )
"""

parsed = [(id, a or b) for id, a, b in re.findall(rx, line)]
print parsed

I personally find dictionaries more suitable for this kind of data, that is:

parsed = {id: a or b for id, a, b in re.findall(rx, log)}

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. -- jwz.

Whenever you're having problems writing a regexp, the first thing you should do is ask whether you really need a regexp. After all, if you can't figure out how to write it without using a graphical regexp explorer or getting someone else to help you, are you going to be able to debug it, expand it, or even read it in a couple months?

Your quoting rules seem to be the same as default CSV quoting rules. Which means you can let the csv module do the hard work for you, and then just split the key-value pairs, which is the easy part:

import csv
fname = "syslog.log"
with open(fname) as fp: 
    reader = csv.reader(fp, delimiter=' ')
    for row in reader:
        key_val = [col.split('=', 1) for col in row]
        print(key_val)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM