Cannot open html file in Python

Question

I am trying to gather how many hyperlinks are in an html file. To do that, I want to read the html file in Python and do a search for all of the </a> anchors. However, it seems that when I try to pass an html file through python, I get an error that reads:

"UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1819: ordinal not in range(128)"

However, if I copy and paste that same text into a txt file, then my code works. My code is as follows:

def links(filename):
    infile = open(filename)
    content = infile.read()
    infile.close()
    anchorTagEnd = content.count("</a>")
    return anchorTagEnd

print(links("DePaul CDM - College of Computing and Digital Media.html"))

Answer 1

Why not use an HTML parser to count the links inside an HTML file.

Using BeautifulSoup :

from bs4 import BeautifulSoup

def links(filename):
    soup = BeautifulSoup(open(filename))
    return len(soup.find_all('a'))

print(links("DePaul CDM - College of Computing and Digital Media.html"))

Using lxml.html :

import lxml.html

def links(filename):
    tree = lxml.html.parse(filename)
    return tree.xpath('count(//a)')[0]

print(links("DePaul CDM - College of Computing and Digital Media.html"))

Cannot open html file in Python

Question

1 answers

solution1
0 2015-02-03 04:56:29

Cannot open html file in Python

Question

1 answers

solution1 0 2015-02-03 04:56:29

solution1
0 2015-02-03 04:56:29