简体   繁体   English

多个标签的Python正则表达式

[英]Python regular expression for multiple tags

I would like to know how to retrieve all results from each <p> tag. 我想知道如何从每个<p>标签中检索所有结果。

import re
htmlText = '<p data="5" size="4">item1</p><p size="4">item2</p><p size="4">item3</p>'
print re.match('<p[^>]*size="[0-9]">(.*?)</p>', htmlText).groups()

result: 结果:

('item1', )

what I need: 我需要的:

('item1', 'item2', 'item3')

For this type of problem, it is recommended to use a DOM parser, not regex. 对于此类问题,建议使用DOM解析器,而不是正则表达式。

I've seen Beautiful Soup frequently recommended for Python 我见过经常推荐用于Python的Beautiful Soup

Beautiful soup is definitely the way to go with a problem like this. 美丽的汤肯定是这样的问题的方式。 The code is cleaner and easier to read. 代码更清晰,更易于阅读。 Once you have it installed, getting all the tags looks something like this. 安装完成后,获取所有标签就像这样。

from BeautifulSoup import BeautifulSoup
import urllib2

def getTags(tag):
  f = urllib2.urlopen("http://cnn.com")
  soup = BeautifulSoup(f.read())
  return soup.findAll(tag)


if __name__ == '__main__':
  tags = getTags('p')
  for tag in tags: print(tag.contents)

This will print out all the values of the p tags. 这将打印出p标签的所有值。

The regex answer is extremely fragile. 正则表达式的答案非常脆弱。 Here's proof (and a working BeautifulSoup example). 这是证明(以及一个有效的BeautifulSoup示例)。

from BeautifulSoup import BeautifulSoup

# Here's your HTML
html = '<p data="5" size="4">item1</p><p size="4">item2</p><p size="4">item3</p>'

# Here's some simple HTML that breaks your accepted 
# answer, but doesn't break BeautifulSoup.
# For each example, the regex will ignore the first <p> tag.
html2 = '<p size="4" data="5">item1</p><p size="4">item2</p><p size="4">item3</p>'
html3 = '<p data="5" size="4" >item1</p><p size="4">item2</p><p size="4">item3</p>'
html4 = '<p data="5" size="12">item1</p><p size="4">item2</p><p size="4">item3</p>'

# This BeautifulSoup code works for all the examples.
paragraphs = BeautifulSoup(html).findAll('p')
items = [''.join(p.findAll(text=True)) for p in paragraphs]

Use BeautifulSoup. 使用BeautifulSoup。

You can use re.findall like this: 您可以像这样使用re.findall

import re
html = '<p data="5" size="4">item1</p><p size="4">item2</p><p size="4">item3</p>'
print re.findall('<p[^>]*size="[0-9]">(.*?)</p>', html)
# This prints: ['item1', 'item2', 'item3']

Edit : ...but as the many commenters have pointed out, using regular expressions to parse HTML is usually a bad idea. 编辑 :...但正如许多评论者指出的那样,使用正则表达式解析HTML通常是一个坏主意。

Alternatively, xml.dom.minidom will parse your HTML if, 或者, xml.dom.minidom将解析您的HTML,如果,

  • ...it is wellformed ......它很好
  • ...you embed it in a single root element. ...你将它嵌入一个根元素中。

Eg, 例如,

>>> import xml.dom.minidom
>>> htmlText = '<p data="5" size="4">item1</p><p size="4">item2</p><p size="4">item3</p>'
>>> d = xml.dom.minidom.parseString('<not_p>%s</not_p>' % htmlText)
>>> tuple(map(lambda e: e.firstChild.wholeText, d.firstChild.childNodes))
('item1', 'item2', 'item3')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM