简体   繁体   中英

Regular expression, matching between two patterns in a multiline string

I have a multiline string, and I want a regular expression to grab some stuff from in between two patterns. For example, here I am trying to match everything between the title and date

For example:

s ="""\n#here's a title\n\nhello world!!!\n\nPosted on 11-09-2014 02:32:30"""
re.findall(r'#.+\n',s)[0][1:-1] # this grabs the title
Out: "here's a title"
re.findall(r'Posted on .+\n',s)[0][10:-1] #this grabs the date
Out: "11-09-2014 02:32:30"
re.findall(r'^[#\W+]',s) # try to grab everything after the title
Out: ['\n'] # but it only grabs until the end of line
>>> s = '''\n#here's a title\n\nhello world!!!\n\nPosted on 11-09-2014 02:32:30'''
>>> m1 = re.search(r'^#.+$', s, re.MULTILINE)
>>> m2 = re.search(r'^Posted on ', s, re.MULTILINE)
>>> m1.end()
16
>>> m2.start()
34
>>> s[m1.end():m2.start()]
'\n\nhello world!!!\n\n'

Don't forget to check that m1 and m2 are not None .

>>> re.findall(r'\n([^#].*)Posted', s, re.S)
['\nhello world!!!\n\n']

If you want to avoid the newlines:

>>> re.findall(r'^([^#\n].*?)\n+Posted', s, re.S + re.M)
['hello world!!!']

You could match all using one regular expression.

>>> s = '''\n#here's a title\n\nhello world!!!\n\nPosted on 11-09-2014 02:32:30'''
>>> re.search(r'#([^\n]+)\s+([^\n]+)\s+\D+([^\n]+)', s).groups()
("here's a title", 'hello world!!!', '11-09-2014 02:32:30')

You should use a group match using parenthesis:

    result = re.search(r'#[^\n]+\n+(.*)\n+Posted on .*', s, re.MULTILINE | re.DOTALL)
    result.group(1)

Here I've used search , but you can still use findall if the same string may contain multiple matches...

If you want to capture the title, the content and the date, you can use multiple groups:

    result = re.search(r'#([^\n]+)\n+(.*)\n+Posted on ([^\n]*)', s, re.MULTILINE | re.DOTALL)
    result.group(1) # The title
    result.group(2) # The contents
    result.group(3) # The date

Catching all 3 in the same regex is much better than using one for each part, specially if your multiline string may contain multiple matches (where 'syncing' your individual findall results together could easily lead to wrong title-content-date combinations).

If you are going to use this regex a lot, consider compiling it once for performance:

    regex = re.compile(r'#([^\n]+)\n+(.*)\n+(Posted on [^\n]*)', re.MULTILINE | re.DOTALL)
    # ...
    result = regex.search(s)
    result = regex.search('another multiline string, ...')

Use group match with non-greedy search (.*?). And give the group a name for easier lookup.

>>> s = '\n#here\'s a title\n\nhello world!!!\n\nPosted on 11-09-2014 02:32:30'
>>> pattern = r'\s*#[\w \']+\n+(?P<content>.*?)\n+Posted on'
>>> a = re.match(pattern, s, re.M)
>>> a.group('content')
'hello world!!!'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM