简体   繁体   English

正则表达式,多行字符串中两个模式之间的匹配

[英]Regular expression, matching between two patterns in a multiline string

I have a multiline string, and I want a regular expression to grab some stuff from in between two patterns. 我有一个多行字符串,并且我想要一个正则表达式从两个模式之间获取一些东西。 For example, here I am trying to match everything between the title and date 例如,在这里我试图匹配标题和日期之间的所有内容

For example: 例如:

s ="""\n#here's a title\n\nhello world!!!\n\nPosted on 11-09-2014 02:32:30"""
re.findall(r'#.+\n',s)[0][1:-1] # this grabs the title
Out: "here's a title"
re.findall(r'Posted on .+\n',s)[0][10:-1] #this grabs the date
Out: "11-09-2014 02:32:30"
re.findall(r'^[#\W+]',s) # try to grab everything after the title
Out: ['\n'] # but it only grabs until the end of line
>>> s = '''\n#here's a title\n\nhello world!!!\n\nPosted on 11-09-2014 02:32:30'''
>>> m1 = re.search(r'^#.+$', s, re.MULTILINE)
>>> m2 = re.search(r'^Posted on ', s, re.MULTILINE)
>>> m1.end()
16
>>> m2.start()
34
>>> s[m1.end():m2.start()]
'\n\nhello world!!!\n\n'

Don't forget to check that m1 and m2 are not None . 不要忘记检查m1m2是否不是None

>>> re.findall(r'\n([^#].*)Posted', s, re.S)
['\nhello world!!!\n\n']

If you want to avoid the newlines: 如果要避免换行符:

>>> re.findall(r'^([^#\n].*?)\n+Posted', s, re.S + re.M)
['hello world!!!']

You could match all using one regular expression. 您可以使用一个正则表达式匹配所有内容。

>>> s = '''\n#here's a title\n\nhello world!!!\n\nPosted on 11-09-2014 02:32:30'''
>>> re.search(r'#([^\n]+)\s+([^\n]+)\s+\D+([^\n]+)', s).groups()
("here's a title", 'hello world!!!', '11-09-2014 02:32:30')

You should use a group match using parenthesis: 您应该使用带括号的分组匹配:

    result = re.search(r'#[^\n]+\n+(.*)\n+Posted on .*', s, re.MULTILINE | re.DOTALL)
    result.group(1)

Here I've used search , but you can still use findall if the same string may contain multiple matches... 在这里,我使用了search ,但是如果同一字符串可能包含多个匹配项,您仍然可以使用findall

If you want to capture the title, the content and the date, you can use multiple groups: 如果要捕获标题,内容和日期,则可以使用多个组:

    result = re.search(r'#([^\n]+)\n+(.*)\n+Posted on ([^\n]*)', s, re.MULTILINE | re.DOTALL)
    result.group(1) # The title
    result.group(2) # The contents
    result.group(3) # The date

Catching all 3 in the same regex is much better than using one for each part, specially if your multiline string may contain multiple matches (where 'syncing' your individual findall results together could easily lead to wrong title-content-date combinations). 在同一个正则表达式中捕获全部3个结果要比对每个部分使用一个正则表达式好得多,特别是如果您的多行字符串可能包含多个匹配项(在其中将各个findall结果“同步”在一起很容易导致错误的title-content-date组合)。

If you are going to use this regex a lot, consider compiling it once for performance: 如果您打算大量使用此正则表达式,请考虑对其进行一次编译以提高性能:

    regex = re.compile(r'#([^\n]+)\n+(.*)\n+(Posted on [^\n]*)', re.MULTILINE | re.DOTALL)
    # ...
    result = regex.search(s)
    result = regex.search('another multiline string, ...')

Use group match with non-greedy search (.*?). 将组匹配与非贪婪搜索(。*?)一起使用。 And give the group a name for easier lookup. 并给组起一个名称以便于查找。

>>> s = '\n#here\'s a title\n\nhello world!!!\n\nPosted on 11-09-2014 02:32:30'
>>> pattern = r'\s*#[\w \']+\n+(?P<content>.*?)\n+Posted on'
>>> a = re.match(pattern, s, re.M)
>>> a.group('content')
'hello world!!!'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM