[英]Beautiful Soup doesn't 'get' full webpage
I am using BeautifulSoup to parse a bunch of links from this page but it wasn't extracting all the links I wanted it to. 我正在使用BeautifulSoup来解析此页面上的一堆链接,但它并未提取我想要的所有链接。 To try and figure out why, I downloaded the html to "web_page.html" and ran
为了找出原因,我将html下载到“ web_page.html”并运行
soup = BeautifulSoup(open("web_page.html"))
print soup.get_text()
I notice that it doesn't print the whole web page. 我注意到它不能打印整个网页。 It ends at Brackley.
它结束于布雷克利。 I looked at the html code to see if something weird was happening at 'Brackley' but I couldn't find anything.
我看了一下html代码,看在'Brackley'上是否发生了一些奇怪的事情,但是我什么也没找到。 Plus if I move another link to Brackley's place it will print that and not Brackley.
另外,如果我将另一个链接移到Brackley的位置,它将打印该链接,而不是Brackley。 It seems like it will only read a certain size html file?
看来它只会读取一定大小的html文件?
Not sure how have you got the page and links, here is what I did and got all the links starting from "Canada" and ending with "Taloyoak, HAM": 不知道您如何获得页面和链接,这是我所做的,并获得了所有从“加拿大”开始并以“ Taloyoak,HAM”结尾的链接:
from bs4 import BeautifulSoup
import requests
url = 'http://www12.statcan.gc.ca/census-recensement/2006/dp-pd/tbt/Geo-index-eng.cfm?TABID=5&LANG=E&APATH=3&DETAIL=0&DIM=0&FL=A&FREE=0&GC=0&GID=0&GK=0&GRP=1&PID=99015&PRID=0&PTYPE=88971,97154&S=0&SHOWALL=0&SUB=0&Temporal=2006&THEME=70&VID=0&VNAMEE=&VNAMEF=&D1=0&D2=0&D3=0&D4=0&D5=0&D6=0'
response = requests.get(url)
soup = BeautifulSoup(response.content)
print [a.text for a in soup.select('div.span-8 ol li a')]
Prints: 印刷品:
[
u'Canada',
u'Newfoundland and Labrador / Terre-Neuve-et-Labrador',
...
u'Gjoa Haven, HAM',
u'Taloyoak, HAM'
]
FYI, div.span-8 ol li a
is a CSS Selector
. 仅供参考,
div.span-8 ol li a
是一个CSS Selector
。
Try using different parsers. 尝试使用其他解析器。 You are not specifying one, so you are probably using the default
html.parser
. 您没有指定一个,所以您可能正在使用默认的
html.parser
。 Try using lxml
or html5lib
. 尝试使用
lxml
或html5lib
。
For more info: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser 有关更多信息: http : //www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.