简体   繁体   中英

BeautifulSoup incorrectly parses page and doesn't find links

Here is a simple code in python 2.7.2, which fetches site and gets all links from given site:

import urllib2
from bs4 import BeautifulSoup

def getAllLinks(url):
    response = urllib2.urlopen(url)
    content = response.read()
    soup = BeautifulSoup(content, "html5lib")
    return soup.find_all("a")

links1 = getAllLinks('http://www.stanford.edu')
links2 = getAllLinks('http://med.stanford.edu/')

print len(links1)
print len(links2)

Problem is that it doesn't work in second scenario. It prints 102 and 0, while there are clearly links on the second site. BeautifulSoup doesn't throw parsing errors and it pretty prints markup ok. I suspect it maybe caused by first line from source of med.stanford.edu which says that it's xml (even though content-type is: text/html):

<?xml version="1.0" encoding="iso-8859-1"?>

I can't figure out how to set up Beautiful to disregard it, or workaround. I'm using html5lib as parser because I had problems with default one (incorrect markup).

When a document claims to be XML, I find the lxml parser gives the best results. Trying your code but using the lxml parser instead of html5lib finds the 300 links.

You are precisely right that the problem is the <?xml... line. Disregarding it is very simple: just skip the first line of content, by replacing

    content = response.read()

with something like

    content = "\n".join(response.readlines()[1:])

Upon this change, len(links2) becomes 300.

ETA: You probably want to do this conditionally, so you don't always skip the first line of content. An example would be something like:

content = response.read()
if content.startswith("<?xml"):
    content = "\n".join(content.split("\n")[1:])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM