多个标签的Python正则表达式

Question

I would like to know how to retrieve all results from each <p> tag. 我想知道如何从每个<p>标签中检索所有结果。

import re
htmlText = '<p data="5" size="4">item1</p><p size="4">item2</p><p size="4">item3</p>'
print re.match('<p[^>]*size="[0-9]">(.*?)</p>', htmlText).groups()

result: 结果：

('item1', )

what I need: 我需要的：

('item1', 'item2', 'item3')

Answer 1

For this type of problem, it is recommended to use a DOM parser, not regex. 对于此类问题，建议使用DOM解析器，而不是正则表达式。

I've seen Beautiful Soup frequently recommended for Python 我见过经常推荐用于Python的Beautiful Soup

Answer 2

Beautiful soup is definitely the way to go with a problem like this. 美丽的汤肯定是这样的问题的方式。 The code is cleaner and easier to read. 代码更清晰，更易于阅读。 Once you have it installed, getting all the tags looks something like this. 安装完成后，获取所有标签就像这样。

from BeautifulSoup import BeautifulSoup
import urllib2

def getTags(tag):
  f = urllib2.urlopen("http://cnn.com")
  soup = BeautifulSoup(f.read())
  return soup.findAll(tag)


if __name__ == '__main__':
  tags = getTags('p')
  for tag in tags: print(tag.contents)

This will print out all the values of the p tags. 这将打印出p标签的所有值。

Answer 3

The regex answer is extremely fragile. 正则表达式的答案非常脆弱。 Here's proof (and a working BeautifulSoup example). 这是证明（以及一个有效的BeautifulSoup示例）。

from BeautifulSoup import BeautifulSoup

# Here's your HTML
html = '<p data="5" size="4">item1</p><p size="4">item2</p><p size="4">item3</p>'

# Here's some simple HTML that breaks your accepted 
# answer, but doesn't break BeautifulSoup.
# For each example, the regex will ignore the first <p> tag.
html2 = '<p size="4" data="5">item1</p><p size="4">item2</p><p size="4">item3</p>'
html3 = '<p data="5" size="4" >item1</p><p size="4">item2</p><p size="4">item3</p>'
html4 = '<p data="5" size="12">item1</p><p size="4">item2</p><p size="4">item3</p>'

# This BeautifulSoup code works for all the examples.
paragraphs = BeautifulSoup(html).findAll('p')
items = [''.join(p.findAll(text=True)) for p in paragraphs]

Use BeautifulSoup. 使用BeautifulSoup。

Answer 4

You can use re.findall like this: 您可以像这样使用re.findall ：

import re
html = '<p data="5" size="4">item1</p><p size="4">item2</p><p size="4">item3</p>'
print re.findall('<p[^>]*size="[0-9]">(.*?)</p>', html)
# This prints: ['item1', 'item2', 'item3']

Edit : ...but as the many commenters have pointed out, using regular expressions to parse HTML is usually a bad idea. 编辑：...但正如许多评论者指出的那样，使用正则表达式解析HTML通常是一个坏主意。

Answer 5

Alternatively, xml.dom.minidom will parse your HTML if, 或者， xml.dom.minidom将解析您的HTML，如果，

...it is wellformed ......它很好
...you embed it in a single root element. ...你将它嵌入一个根元素中。

Eg, 例如，

>>> import xml.dom.minidom
>>> htmlText = '<p data="5" size="4">item1</p><p size="4">item2</p><p size="4">item3</p>'
>>> d = xml.dom.minidom.parseString('<not_p>%s</not_p>' % htmlText)
>>> tuple(map(lambda e: e.firstChild.wholeText, d.firstChild.childNodes))
('item1', 'item2', 'item3')

多个标签的Python正则表达式

问题描述

5 个解决方案

解决方案1
11 2009-06-09 22:14:02

解决方案2
5 2009-06-09 23:00:36

解决方案3
4 已采纳 2009-06-10 03:19:07

解决方案4
2 2009-06-09 22:12:46

解决方案5
2 2009-06-09 22:38:25

多个标签的Python正则表达式

问题描述

5 个解决方案

解决方案1 11 2009-06-09 22:14:02

解决方案2 5 2009-06-09 23:00:36

解决方案3 4 已采纳 2009-06-10 03:19:07

解决方案4 2 2009-06-09 22:12:46

解决方案5 2 2009-06-09 22:38:25

解决方案1
11 2009-06-09 22:14:02

解决方案2
5 2009-06-09 23:00:36

解决方案3
4 已采纳 2009-06-10 03:19:07

解决方案4
2 2009-06-09 22:12:46

解决方案5
2 2009-06-09 22:38:25