Beautiful Soup doesn't 'get' full webpage

Question

I am using BeautifulSoup to parse a bunch of links from this page but it wasn't extracting all the links I wanted it to. To try and figure out why, I downloaded the html to "web_page.html" and ran

soup = BeautifulSoup(open("web_page.html"))
print soup.get_text()

I notice that it doesn't print the whole web page. It ends at Brackley. I looked at the html code to see if something weird was happening at 'Brackley' but I couldn't find anything. Plus if I move another link to Brackley's place it will print that and not Brackley. It seems like it will only read a certain size html file?

Answer 1

Not sure how have you got the page and links, here is what I did and got all the links starting from "Canada" and ending with "Taloyoak, HAM":

from bs4 import BeautifulSoup
import requests

url = 'http://www12.statcan.gc.ca/census-recensement/2006/dp-pd/tbt/Geo-index-eng.cfm?TABID=5&LANG=E&APATH=3&DETAIL=0&DIM=0&FL=A&FREE=0&GC=0&GID=0&GK=0&GRP=1&PID=99015&PRID=0&PTYPE=88971,97154&S=0&SHOWALL=0&SUB=0&Temporal=2006&THEME=70&VID=0&VNAMEE=&VNAMEF=&D1=0&D2=0&D3=0&D4=0&D5=0&D6=0'
response = requests.get(url)

soup = BeautifulSoup(response.content)
print [a.text for a in soup.select('div.span-8 ol li a')]

Prints:

[
    u'Canada', 
    u'Newfoundland and Labrador / Terre-Neuve-et-Labrador',
    ...
    u'Gjoa Haven, HAM', 
    u'Taloyoak, HAM'
]

FYI, div.span-8 ol li a is a CSS Selector .

Answer 2

Try using different parsers. You are not specifying one, so you are probably using the default html.parser . Try using lxml or html5lib .

For more info: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

Beautiful Soup doesn't 'get' full webpage

Question

2 answers

solution1
2 2014-11-13 16:28:31

solution2
2 ACCPTED 2014-11-13 16:28:37

Beautiful Soup doesn't 'get' full webpage

Question

2 answers

solution1 2 2014-11-13 16:28:31

solution2 2 ACCPTED 2014-11-13 16:28:37

solution1
2 2014-11-13 16:28:31

solution2
2 ACCPTED 2014-11-13 16:28:37