简体繁体中英

Python: Need to extract tag content from html page using regex, but not BeautifulSoup

原文 2011-04-28 06:25:14 0 2 python/ html/ regex/ tags/ substring

I have a requirement wherein I have to extract content inside <raw> tag. For example I need to extract abcd and efgh from this html snippet:
<html><body><raw somestuff>abcd</raw><raw somesuff>efgh</raw></body></html>

I used this code in my python
re.match(r'.*raw.*(.*)/raw.*', DATA)

But this is not returning any substring. I'm not good at regex. So a correction to this or a new solution would help me a great deal. I am not supposed to use external libs (due to some restriction in my company).

2 answers

Your company really needs to rethink their policy. Rewriting an XML parser is a complete waste of time, there are already several for Python. Some are included in the stdlib, so if you can import re you should also be allowed to import xml.etree.ElementTree or anything else listed at http://docs.python.org/library/markup.html .

You really should be using one of those. No sense duplicating all of that work.

Using non greedy matching (*?) can do this easily, at least for your example.

re.findall(r'<raw[^>]*?>(.*?)</raw>', DATA)

Extract a certain content from html using python BeautifulSoup

Python: How to extract URL from HTML Page using BeautifulSoup?

How to extract Table contents from an HTML page using BeautifulSoup in Python?

Python Regex to extract content of src of an html tag?

How to extract Facebook page URL from HTML <a> tag using Regex in Python?

Extract links from html page using BeautifulSoup

Extract data from html page using Beautifulsoup

Extract text only except the content of script tag from html with BeautifulSoup

Extract Columns from html using Python (Beautifulsoup)

Extract JSON from HTML Script tag with BeautifulSoup in Python

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Extract a certain content from html using python BeautifulSoup Python: How to extract URL from HTML Page using BeautifulSoup? How to extract Table contents from an HTML page using BeautifulSoup in Python? Python Regex to extract content of src of an html tag? How to extract Facebook page URL from HTML <a> tag using Regex in Python? Extract links from html page using BeautifulSoup Extract data from html page using Beautifulsoup Extract text only except the content of script tag from html with BeautifulSoup Extract Columns from html using Python (Beautifulsoup) Extract JSON from HTML Script tag with BeautifulSoup in Python

Related Tags

Python: Need to extract tag content from html page using regex, but not BeautifulSoup

Question

2 answers

solution1
6 2011-04-28 06:36:49

solution2
0 2011-04-28 06:33:14

Python: Need to extract tag content from html page using regex, but not BeautifulSoup

Question

2 answers

solution1 6 2011-04-28 06:36:49

solution2 0 2011-04-28 06:33:14

solution1
6 2011-04-28 06:36:49

solution2
0 2011-04-28 06:33:14