简体   繁体   English

正则表达式-匹配两个字符串之间的所有文本

[英]Regex - matching all text between two strings

I'm currently parsing a log file that has the following structure: 我目前正在解析具有以下结构的日志文件:

1) timestamp, preceded by # character and followed by \\n 1)时间戳记,后跟#个字符,后跟\\ n

2) arbitrary # of events that happened after that timestamp and all followed by \\n 2)在该时间戳记之后发生的任意事件数,所有事件后跟\\ n

3) repeat.. 3)重复

Here is an exmaple: 这是一个例子:

#100
04!
03!
02!
#1299
0L
0K
0J
0E
#1335
06!
0X#
0[#
b1010 Z$
b1x [$
...

Please forgive the seemingly cryptic values, they are encodings representing certain "events". 请原谅看似神秘的值,它们是代表某些“事件”的编码。

Note: Event encodings may also use the # character. 注意:事件编码也可以使用#字符。

What I am trying to do is to count the number of events that happen at a certain time. 我要做的是计算在特定时间发生的事件数。

In other words, at time 100, 3 events happened. 换句话说,在时间100,发生了3个事件。

I am trying to match all text between two timestamps - and count the number of events by simply counting the number of newlines enclosed in the matched text. 我正在尝试匹配两个时间戳之间的所有文本-并通过简单地计算匹配文本中包含的换行符来计算事件数。

I'm using Python's regex engine, and I'm using the following expression: 我正在使用Python的regex引擎,并且正在使用以下表达式:

pattern = re.compile('(#[0-9]{2,}.*)(?!#[0-9]+)')

Note: The {2,} is because I want timestamps with at least two digits. 注意:{2,}是因为我要使用至少两位数字的时间戳。

I match a timestamp, continue matching any other characters until hitting another timestamp - ending the matching. 我匹配了一个时间戳,继续匹配所有其他字符,直到遇到另一个时间戳-结束匹配。

What this returns is: 这返回的是:

#100
#1299
#1335

So, I get the timestamps - but none of the events data - what I really care about! 因此,我得到了时间戳-但没有任何事件数据-我真正关心的是!

I'm thinking the reason for this is that the negative-lookbehind is "greedy" - but I'm not completely sure. 我在想这是因为背后的负面表情是“贪婪的”-但我不确定。

There may be an entirely different regex that makes this much simpler - open to any suggestions! 可能有一个完全不同的正则表达式使这一过程变得更加简单-接受任何建议!

Any help is much appreciated! 任何帮助深表感谢!

-k -k

I think a regex is not a good tool for the job here. 我认为正则表达式不是在此工作的好工具。 You can just use a loop.. 您可以使用循环。

>>> import collections
>>> d = collections.defaultdict(list)
>>> with open('/tmp/spam.txt') as f:
...   t = 'initial'
...   for line in f:
...     if line.startswith('#'):
...       t = line.strip()
...     else:
...       d[t].append(line.strip())
... 
>>> for k,v in d.iteritems():
...   print k, len(v)
... 
#1299 4
#100 3
#1335 6

The reason is that the dot doesn't match newlines, so your expression will only match the lines containing the timestamp; 原因是点与换行符不匹配,因此您的表达式将仅与包含时间戳记的行匹配; the match won't go across multiple lines. 这场比赛不会跨越多条线。 You could pass the "dotall" flag to re.compile so that your expression will match across multiple lines. 您可以传递“ dotall”标志进行重新re.compile以便表达式可以跨多行匹配。 Since you say the "event encodings" might also contain a # character, you might also want to use the multiline flag and anchor your match with ^ at the beginning so it only matches the # at the beginning of a line. 由于您说“事件编码”也可能包含#字符,因此您可能还想使用多行标志,并将匹配项以^开头定位,因此它仅与行开头的#匹配。

You could just loop through the data line by line and have a dictionary that just stores the number of events associated with each timestamp; 您可以逐行遍历数据,并拥有一个字典,该字典仅存储与每个时间戳关联的事件数。 no regex required. 无需正则表达式。 For example: 例如:

with open('exampleData') as example:
    eventCountsDict = {}
    currEvent = None
    for line in example:
        if line[0] == '#': # replace this line with more specific timestamp details if event encodings can start with a '#'
            eventCountsDict[line] = 0
            currEvent = line
        else:
            eventCountsDict[currEvent] += 1

print eventCountsDict

That code prints {'#1299\\n': 4, '#1335\\n': 5, '#100\\n': 3} for your example data (not counting the ... ). 该代码将为您的示例数据打印{'#1299\\n': 4, '#1335\\n': 5, '#100\\n': 3} (不计算... )。

If you insist on a regex-based solution, I propose this: 如果您坚持基于正则表达式的解决方案,我建议这样做:

>>> pat = re.compile(r'(^#[0-9]{2,})\s*\n((?:[^#].*\n)*)', re.MULTILINE)
>>> for t, e in pat.findall(s):
...     print t, e.count('\n')
...
#100 3
#1299 4
#1335 6

Explanation: 说明:

(              
  ^            anchor to start of line in multiline mode
  #[0-9]{2,}   line starting with # followed by numbers
)
\s*            skip whitespace just in case (eg. Windows line separator)
\n             new line
(
  (?:          repeat non-capturing group inside capturing group to capture 
               all repetitions
    [^#].*\n   line not starting with #
  )*
)

You seemed to have misunderstood what negative lookahead does. 您似乎误解了负面前瞻的作用。 When it follows .* , the regex engine first tries to consume as many characters as possible and only then checks the lookahead pattern. 当它跟随.* ,正则表达式引擎首先尝试消耗尽可能多的字符,然后才检查超前模式。 If the lookahead does not match, it will backtrack character by character until it does. 如果前瞻不匹配,它将逐字符回溯直到匹配。

You could, however, use positive lookahead together with the non-greedy .*? 但是,您可以与非贪婪的.*?一起使用正向超前.*? . Here the .*? 这是.*? will consume characters until the lookahead sees either a # at start of a line, or the end of the whole string: 将消耗字符,直到前行在行的开头或整个字符串的末尾看到一个#:

re.compile(r'(^#[0-9]{2,})\s*\n(.*?)(?=^#|\Z)', re.DOTALL | re.MULTILINE)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM