简体   繁体   中英

Regex to extract between two strings (which are variables)

I am looking to use regex to extract text which occurs between two strings. I know how to do if i want to extract between the same strings every time (and countless questions asking for this eg Regex matching between two strings? ), but I want to do it using variables which change, and may themselves include special characters within Regex. (i want any special characters, eg * treated as text).

For example if i had:

text = "<b*>Test</b>"
left_identifier = "<b*>"
right_identifier = "</b>

i would want to create regex code which would result in the following code being run:

re.findall('<b\*>(.*)<\/b>',text)

It is the <b\\*>(.*)<\\/b> part that I don't know how to dynamically create.

You can do something like this:

import re
pattern_string = re.escape(left_identifier) + "(.*?)" + re.escape(right_identifier)
pattern = re.compile(pattern_string)

The escape function will automatically escape special characters. For eg:

>>> import re
>>> print re.escape("<b*>")
\<b\*\>

The regex starts its life just as a string, so left_identifier + text + right_identifier and use that in re.compile

Or:

re.findall('{}(.*){}'.format(left_identifier, right_identifier), text)

works too.

You need to escape the strings in the variables if they contain regex metacharacter with re.escape if you do not want the metacharacters interpreted as such:

>>> text = "<b*>Test</b>"
>>> left_identifier = "<b*>"
>>> right_identifier = "</b>"
>>> s='{}(.*?){}'.format(*map(re.escape, (left_identifier, right_identifier)))
>>> s
'\\<b\\*\\>(.*?)\\<\\/b\\>'
>>> re.findall(s, text)
['Test']

On a side note, str.partition(var) is an alternate way to do this:

>>> text.partition(left_identifier)[2].partition(right_identifier)[0]
'Test'

You need to re.escape the identifiers:

>>> regex = re.compile('{}(.*){}'.format(re.escape('<b*>'), re.escape('</b>')))
>>> regex.findall('<b*>Text</b>')
['Text']

I know you actually wanted a regex solution, but I really wonder if regex is the right tool here considering we all have taken oath not to . When parsing html strings, I will always recommend to fall back to beautifulsoup

>>> import bs4
>>> bs4.BeautifulSoup('<b*>Text</b>').text
u'Text'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM