[英]Python urllib2.urlopen bug: timeout error brings down my Internet connection?
[英]Python: urllib urlopen stuck, timeout error
正如标题所述, urlopen 被困在 URL 的打开中。
编码:
from bs4 import BeautifulSoup as soup # HTML data structure
from urllib.request import urlopen as uReq # Web client
page_url = "https://store.hp.com/us/en/pdp/hp-laserjet-pro-m404n?jumpid=ma_weekly-deals_product-tile_printers_3_w1a52a_hp-laserjet-pro-m404"
uClient = uReq(page_url)
# parses html into a soup data structure to traverse html
# as if it were a json data type.
page_soup = soup(uClient.read(), "html.parser")
uClient.close()
print(page_soup)
问题:它卡在 uReq 上。 但是,如果您要用以下链接替换 page_url,则一切正常。
page_url= "http://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=-1&IsNodeId=1&Description=GTX&bop=And&Page=1&PageSize=36&order=BESTMATCH"
错误:超时错误
我怎样才能实现打开给定的 URL,用于 Web Scraping 的目的?
编辑
一些网站需要User-Agent
标头才能产生成功的请求。 导入urllib.request.Request
并修改你的代码如下
from bs4 import BeautifulSoup as soup # HTML data structure
from urllib.request import urlopen as uReq, Request # Web client
page_url = "https://store.hp.com/us/en/pdp/hp-laserjet-pro-m404n?jumpid=ma_weekly-deals_product-tile_printers_3_w1a52a_hp-laserjet-pro-m404"
uClient = uReq(Request(page_url, headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0'
}))
# parses html into a soup data structure to traverse html
# as if it were a json data type.
page_soup = soup(uClient.read(), "html.parser")
uClient.close()
print(page_soup)
你会没事的
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.