Python regular expression for multiple tags

Question

I would like to know how to retrieve all results from each <p> tag.

import re
htmlText = '<p data="5" size="4">item1</p><p size="4">item2</p><p size="4">item3</p>'
print re.match('<p[^>]*size="[0-9]">(.*?)</p>', htmlText).groups()

result:

('item1', )

what I need:

('item1', 'item2', 'item3')

Answer 1

For this type of problem, it is recommended to use a DOM parser, not regex.

I've seen Beautiful Soup frequently recommended for Python

Answer 2

Beautiful soup is definitely the way to go with a problem like this. The code is cleaner and easier to read. Once you have it installed, getting all the tags looks something like this.

from BeautifulSoup import BeautifulSoup
import urllib2

def getTags(tag):
  f = urllib2.urlopen("http://cnn.com")
  soup = BeautifulSoup(f.read())
  return soup.findAll(tag)


if __name__ == '__main__':
  tags = getTags('p')
  for tag in tags: print(tag.contents)

This will print out all the values of the p tags.

Answer 3

The regex answer is extremely fragile. Here's proof (and a working BeautifulSoup example).

from BeautifulSoup import BeautifulSoup

# Here's your HTML
html = '<p data="5" size="4">item1</p><p size="4">item2</p><p size="4">item3</p>'

# Here's some simple HTML that breaks your accepted 
# answer, but doesn't break BeautifulSoup.
# For each example, the regex will ignore the first <p> tag.
html2 = '<p size="4" data="5">item1</p><p size="4">item2</p><p size="4">item3</p>'
html3 = '<p data="5" size="4" >item1</p><p size="4">item2</p><p size="4">item3</p>'
html4 = '<p data="5" size="12">item1</p><p size="4">item2</p><p size="4">item3</p>'

# This BeautifulSoup code works for all the examples.
paragraphs = BeautifulSoup(html).findAll('p')
items = [''.join(p.findAll(text=True)) for p in paragraphs]

Use BeautifulSoup.

Answer 4

You can use re.findall like this:

import re
html = '<p data="5" size="4">item1</p><p size="4">item2</p><p size="4">item3</p>'
print re.findall('<p[^>]*size="[0-9]">(.*?)</p>', html)
# This prints: ['item1', 'item2', 'item3']

Edit : ...but as the many commenters have pointed out, using regular expressions to parse HTML is usually a bad idea.

Answer 5

Alternatively, xml.dom.minidom will parse your HTML if,

...it is wellformed
...you embed it in a single root element.

Eg,

>>> import xml.dom.minidom
>>> htmlText = '<p data="5" size="4">item1</p><p size="4">item2</p><p size="4">item3</p>'
>>> d = xml.dom.minidom.parseString('<not_p>%s</not_p>' % htmlText)
>>> tuple(map(lambda e: e.firstChild.wholeText, d.firstChild.childNodes))
('item1', 'item2', 'item3')

Python regular expression for multiple tags

Question

5 answers

solution1
11 2009-06-09 22:14:02

solution2
5 2009-06-09 23:00:36

solution3
4 ACCPTED 2009-06-10 03:19:07

solution4
2 2009-06-09 22:12:46

solution5
2 2009-06-09 22:38:25

Python regular expression for multiple tags

Question

5 answers

solution1 11 2009-06-09 22:14:02

solution2 5 2009-06-09 23:00:36

solution3 4 ACCPTED 2009-06-10 03:19:07

solution4 2 2009-06-09 22:12:46

solution5 2 2009-06-09 22:38:25

solution1
11 2009-06-09 22:14:02

solution2
5 2009-06-09 23:00:36

solution3
4 ACCPTED 2009-06-10 03:19:07

solution4
2 2009-06-09 22:12:46

solution5
2 2009-06-09 22:38:25