简体   繁体   中英

Python multiline regex extract text after every timestamp

I have a log file I am trying to parse. Each log has a timestamp at the start of a line, in the format YYY-MMM-DD HH:MM:SS.SSSSSS -0400: with timezone information being optional (which I can ignore for now). I can match on these just fine, but not the log after the timestamp, which may start immediately on the same line, or the next line, and may be multiple lines long. I am decent with regex, but I rarely do multiline regex.

Here is what I have tried that seems to be the closet

# finds the first timestamp, everything to end of file is the log
re.findall('\n(^\d{4}-[A-Za-z]{3}-\d{2} \d{2}:\d{2}:\d{2}.\d{6}).*?:(.*)', log, re.DOTALL)

# finds every timestamp, all logs are empty (obviously too un-greedy)
re.findall('\n(^\d{4}-[A-Za-z]{3}-\d{2} \d{2}:\d{2}:\d{2}.\d{6}).*?:(.*?)', log, re.DOTALL)

I just don't know how to go about getting the log that follows, but stop if seeing another timestamp.

You may split the contents with a newline char that is followed with a datetime pattern:

re.split(r'\n(?=\d{4}-[A-Za-z]{3}-\d{2} \d{2}:\d{2}:\d{2}\.\d{6})', log)

Details

  • \n - a newline
  • (?=\d{4}-[A-Za-z]{3}-\d{2} \d{2}:\d{2}:\d{2}\.\d{6}) - a positive lookahead that requires the following pattern to appear immediately to the right of the current location:
    • \d{4}- - four digits and a hyphen
    • [A-Za-z]{3}- - three letters and a hyphen
    • \d{2} - two digits
    • - a sapce
    • \d{2}: - two digits and :
    • \d{2}:\d{2} - - two digits, : , two digits
    • \. - a dot (note it must be escaped)
    • \d{6} - six digits

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM