简体   繁体   中英

Cannot open html file in Python

I am trying to gather how many hyperlinks are in an html file. To do that, I want to read the html file in Python and do a search for all of the </a> anchors. However, it seems that when I try to pass an html file through python, I get an error that reads:

"UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1819: ordinal not in range(128)"

However, if I copy and paste that same text into a txt file, then my code works. My code is as follows:

def links(filename):
    infile = open(filename)
    content = infile.read()
    infile.close()
    anchorTagEnd = content.count("</a>")
    return anchorTagEnd

print(links("DePaul CDM - College of Computing and Digital Media.html"))

Why not use an HTML parser to count the links inside an HTML file.

Using BeautifulSoup :

from bs4 import BeautifulSoup

def links(filename):
    soup = BeautifulSoup(open(filename))
    return len(soup.find_all('a'))

print(links("DePaul CDM - College of Computing and Digital Media.html"))

Using lxml.html :

import lxml.html

def links(filename):
    tree = lxml.html.parse(filename)
    return tree.xpath('count(//a)')[0]

print(links("DePaul CDM - College of Computing and Digital Media.html"))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM