I am trying to gather how many hyperlinks are in an html file. To do that, I want to read the html file in Python and do a search for all of the </a>
anchors. However, it seems that when I try to pass an html file through python, I get an error that reads:
"UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1819: ordinal not in range(128)"
However, if I copy and paste that same text into a txt file, then my code works. My code is as follows:
def links(filename):
infile = open(filename)
content = infile.read()
infile.close()
anchorTagEnd = content.count("</a>")
return anchorTagEnd
print(links("DePaul CDM - College of Computing and Digital Media.html"))
Why not use an HTML parser to count the links inside an HTML file.
Using BeautifulSoup
:
from bs4 import BeautifulSoup
def links(filename):
soup = BeautifulSoup(open(filename))
return len(soup.find_all('a'))
print(links("DePaul CDM - College of Computing and Digital Media.html"))
Using lxml.html
:
import lxml.html
def links(filename):
tree = lxml.html.parse(filename)
return tree.xpath('count(//a)')[0]
print(links("DePaul CDM - College of Computing and Digital Media.html"))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.