简体   繁体   English

Python:urlopen无法下载整个网站

[英]Python: urlopen not downloading the entire site

Greetings, 问候,

I have done: 我已经做好了:

import urllib

site = urllib.urlopen('http://www.weather.com/weather/today/Temple+TX+76504')
site_data = site.read()
site.close()

but it doesn't compare to viewing the source when loaded in firefox. 但它与加载到Firefox中的源代码不一样。

I suspected the user agent and did this: 我怀疑用户代理并执行此操作:

class AppURLopener(urllib.FancyURLopener):
    version = "Mozilla/5.0 (X11; U; Linux i686; zh-CN; rv:1.9.2.8) Gecko/20100722 Ubuntu/10.04 (lucid) Firefox/3.6.8"

urllib._urlopener = AppURLopener()

and downloaded it, but it still doesn't download the whole website. 并下载了它,但仍然无法下载整个网站。

Can someone please help me do user agent switching, if that is the likely culprit? 如果这可能是罪魁祸首,有人可以帮助我进行用户代理切换吗?

Thanks, Narnie 谢谢,纳妮

It's more likely that there is an iframe in the code or that javascript is modifying the DOM. 代码中可能有一个iframe或javascript正在修改DOM。 If theres an iframe, you'll have to parse the page to get the url for the iframe or just do it manually if it's a one-off. 如果有iframe,则必须解析页面以获取iframe的网址,如果是一次性的,则只需手动进行操作即可。 If it's javascript, I hear that selenium-rc is good but have no first hand experience with it. 如果是javascript,我听说selenium-rc很好,但是没有第一手经验。

本地显示的下载页面可能看起来由于多种原因而有所不同,例如存在相对链接(可以在页面标题元素中添加例如<base href="http://www.weather.com/today/">进行修复),或者非功能性ajax请求(请参阅规避同源策略的方法 )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM