Python 正则表达式 - 解析 HTML

Question

I have this little code and it's giving me AttributeError: 'NoneType' object has no attribute 'group'.我有这个小代码，它给了我 AttributeError: 'NoneType' object has no attribute 'group'。

import sys
import re

#def extract_names(filename):

f = open('name.html', 'r')
text = f.read()

match = re.search (r'<hgroup><h1>(\w+)</h1>', text)
second = re.search (r'<li class="hover">Employees: <b>(\d+,\d+)</b></li>', text)  

outf = open('details.txt', 'a')
outf.write(match)
outf.close()

My intention is to read a .HTML file looking for the <h1> tag value and the number of employees and append them to a file.我的目的是读取 .HTML 文件以查找<h1>标记值和员工人数，并将它们附加到文件中。 But for some reason I can't seem to get it right.但出于某种原因，我似乎无法正确理解。 Your help is greatly appreciated.非常感谢您的帮助。

Answer 1

You are using a regular expression, but matching XML with such expressions gets too complicated, too fast.您正在使用正则表达式，但将 XML 与此类表达式匹配会变得太复杂、太快。 Don't do that.不要那样做。

Use a HTML parser instead, Python has several to choose from:改用 HTML 解析器，Python 有几个可供选择：

ElementTree is part of the standard library ElementTree是标准库的一部分
BeautifulSoup is a popular 3rd party library BeautifulSoup是一个流行的 3rd 方库
lxml is a fast and feature-rich C-based library. lxml是一个快速且功能丰富的基于 C 的库。

The latter two handle malformed HTML quite gracefully as well, making decent sense of many a botched website.后两者也非常优雅地处理格式错误的 HTML，使许多拙劣的网站具有很好的意义。

ElementTree example:元素树示例：

from xml.etree import ElementTree

tree = ElementTree.parse('filename.html')
for elem in tree.findall('h1'):
    print ElementTree.tostring(elem)

Answer 2

只是为了完成：您的错误消息只是表明您的正则表达式失败并且没有返回任何内容......

Python 正则表达式 - 解析 HTML

问题描述

2 个解决方案

解决方案1
6 2012-09-20 13:15:09

解决方案2
1 已采纳 2012-09-20 15:35:34

Python 正则表达式 - 解析 HTML

问题描述

2 个解决方案

解决方案1 6 2012-09-20 13:15:09

解决方案2 1 已采纳 2012-09-20 15:35:34

解决方案1
6 2012-09-20 13:15:09

解决方案2
1 已采纳 2012-09-20 15:35:34