简体   繁体   中英

Python: Need to extract tag content from html page using regex, but not BeautifulSoup

I have a requirement wherein I have to extract content inside <raw> tag. For example I need to extract abcd and efgh from this html snippet:
<html><body><raw somestuff>abcd</raw><raw somesuff>efgh</raw></body></html>

I used this code in my python
re.match(r'.*raw.*(.*)/raw.*', DATA)

But this is not returning any substring. I'm not good at regex. So a correction to this or a new solution would help me a great deal. I am not supposed to use external libs (due to some restriction in my company).

Your company really needs to rethink their policy. Rewriting an XML parser is a complete waste of time, there are already several for Python. Some are included in the stdlib, so if you can import re you should also be allowed to import xml.etree.ElementTree or anything else listed at http://docs.python.org/library/markup.html .

You really should be using one of those. No sense duplicating all of that work.

Using non greedy matching (*?) can do this easily, at least for your example.

re.findall(r'<raw[^>]*?>(.*?)</raw>', DATA)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM