Python Regexp问题

Question

我正在尝试对网页上的行进行正则表达式。 该行如下：

<tr><td width=60 bgcolor='#ffffcc'><b>random Value</b></td><td align=center width=80>

这是我尝试过的方法，但似乎没有用，有人可以帮助我吗？ 'htmlbody'包含html页面，不，我没有忘记导入're'。

reg = re.compile("<tr><td width=60 bgcolor='#ffffcc'><b>([^<]*)</b></td><td align=center width=80>")
value = reg.search(htmlbody)
print 'Value is', value

Answer 1

使用正则表达式没有万无一失的方法。 请参阅能否提供一些示例，说明为什么很难用正则表达式解析XML和HTML？ 为什么。 您需要一个HTML解析器，例如HTMLParser ：

#!/usr/bin/python

from HTMLParser import HTMLParser

class FindTDs(HTMLParser):
        def __init__(self):
                HTMLParser.__init__(self)
                self.level = 0

        def handle_starttag(self, tag, attrs):
                if tag == 'td':
                        self.level = self.level + 1

        def handle_endtag(self, tag):
                if tag == 'td':
                        self.level = self.level - 1

        def handle_data(self, data):
                if self.level > 0:
                        print data

find = FindTDs()

html = "<table>\n"
for i in range(3):
        html += "\t<tr>"
        for j in range(5):
                html += "<td>%s.%s</td>" % (i, j)
        html += "</tr>\n"
html += "</table>"

find.feed(html)

Answer 2

这个

import re

htmlbody = "<tr><td width=60 bgcolor='#ffffcc'><b>random Value</b></td><td align=center width=80>"

reg = re.compile("<tr><td width=60 bgcolor='#ffffcc'><b>([^<]*)</b></td><td align=center width=80>")
value = reg.search(htmlbody).group(1)
print 'Value is', value

打印出来

Value is random Value

这是你想要的吗？

Answer 3

听起来您可能想使用findall而不是search ：

reg = re.compile("<tr><td width=60 bgcolor='#ffffcc'><b>([^<]*)</b></td><td align=center width=80>")
value = reg.findall(htmlbody)
print 'Found %i match(es)' % len(value)

不过，我要提醒您，众所周知，正则表达式在处理HTML方面很差。 最好使用Python内置的HTMLParser模块使用适当的解析器。

Python Regexp问题

问题描述

3 个解决方案

解决方案1
4 2009-04-17 23:22:47

解决方案2
1 2009-04-17 22:56:45

解决方案3
1 已采纳 2009-04-17 23:26:50

Python Regexp问题

问题描述

3 个解决方案

解决方案1 4 2009-04-17 23:22:47

解决方案2 1 2009-04-17 22:56:45

解决方案3 1 已采纳 2009-04-17 23:26:50

解决方案1
4 2009-04-17 23:22:47

解决方案2
1 2009-04-17 22:56:45

解决方案3
1 已采纳 2009-04-17 23:26:50