无法在Python中打开HTML文件

Question

我正在尝试收集html文件中有多少个超链接。 为此，我想在Python中读取html文件并搜索所有</a>锚点。 但是，似乎当我尝试通过python传递html文件时，出现以下错误：

“ UnicodeDecodeError：'ascii'编解码器无法解码位置1819的字节0xe2：序数不在范围（128）中”

但是，如果我将相同的文本复制并粘贴到txt文件中，则我的代码有效。 我的代码如下：

def links(filename):
    infile = open(filename)
    content = infile.read()
    infile.close()
    anchorTagEnd = content.count("</a>")
    return anchorTagEnd

print(links("DePaul CDM - College of Computing and Digital Media.html"))

Answer 1

为什么不使用HTML解析器来计数HTML文件中的链接。

使用BeautifulSoup ：

from bs4 import BeautifulSoup

def links(filename):
    soup = BeautifulSoup(open(filename))
    return len(soup.find_all('a'))

print(links("DePaul CDM - College of Computing and Digital Media.html"))

使用lxml.html ：

import lxml.html

def links(filename):
    tree = lxml.html.parse(filename)
    return tree.xpath('count(//a)')[0]

print(links("DePaul CDM - College of Computing and Digital Media.html"))

无法在Python中打开HTML文件

问题描述

1 个解决方案

解决方案1
0 2015-02-03 04:56:29

无法在Python中打开HTML文件

问题描述

1 个解决方案

解决方案1 0 2015-02-03 04:56:29

解决方案1
0 2015-02-03 04:56:29