简体   繁体   中英

Searching for characters between two delimiters in a string

I'm trying to parse a string to find all of the characters between two delimiters <code> and </code> .

I have attempted using regular expressions, but I can't seem to understand what is going on.

my attempt:

import re
re.findall('<code>(.*?)</code>', processed_df['question'][2])

where processed_df['question'][2] is the string (this string is continuous, I typed it into multiple lines for readability):

 '<code>for x in finallist:\n    matchinfo = 
 requests.get("https://api.opendota.com/api/matches/{}".format(x)).json() 
 ["match_id"]\n    print(matchinfo)\n</code>'

I have tested with this test_string:

 test_string = '<code> this is a test </code>'

and it seems to work.

I have a feeling it has to do with special characters within the characters between <code> and </code> , but I don't know how to fix it. Thank you for the help!

your might be better of with an html parser than regex

import lxml.html

html_snippet = """
 ...
 <p>Some stuff</p>
 ...
 <code>for x in finallist:\n    matchinfo = 
 requests.get("https://api.opendota.com/api/matches/{}".format(x)).json() 
 ["match_id"]\n    print(matchinfo)\n</code>
 ...
 And some Stuff
 ...
 another code block <br />
 <code>
    print('Hello world')
 </code>
 """

dom = lxml.html.fromstring(html_snippet)
codes = dom.xpath('//code')


for code in codes:
    print(code.text)

 >>>> for x in finallist:
 >>>>     matchinfo = 
 >>>> requests.get("https://api.opendota.com/api/matches/{}".format(x)).json() 
 >>>> ["match_id"]
 >>>>    print(matchinfo)

 >>>> print('Hello world')

I think the issue is the newline \\n character, just make sure to match using the DOTALL flag such as

import re
regex = r"<code>(.*)\<\/code>"

test_str = ("<code>for x in finallist:\\n    matchinfo = \n"
    " requests.get(\"https://api.opendota.com/api/matches/{}\".format(x)).json() \n"
    " [\"match_id\"]\\n    print(matchinfo)\\n</code>\n")

re.findall(regex, test_str, re.DOTALL)

'for x in finallist:\\n    matchinfo = \n requests.get("https://api.opendota.com/api/matches/{}".format(x)).json() \n ["match_id"]\\n    print(matchinfo)\\n'

So the question doesn't explicitly say it needs regular expresions . With that said, I would say not using them is best:

eg

test_str = '''
<code>asldkfj
asdlkfjas
asdlkf
for i in range(asdlkf):
    print("Hey")
    if i == 8:
        print(i)
</code>
'''

start = len('<code>')

end = len('</code>')

new_str = test_str.strip()[start:-end] # Should have everything in between <code></code>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM