简体   繁体   English

Python Scrape urllib2 HTTP错误

[英]Python Scrape urllib2 HTTP Errors

I am trying to scrape a site but my code only works if I have the site open and then refresh it. 我正在尝试抓取网站,但是只有在我打开网站然后刷新它的情况下,我的代码才有效。 I have tried multiple things and keep coming to the following two errors: The first: ValueError: "HTTPError: HTTP Error 416: Requested Range Not Satisfiable" 我尝试了多种方法,并继续出现以下两个错误:第一个:ValueError:“ HTTPError:HTTP错误416:请求的范围不满足”

urlslist = open("list_urls.txt").read()
urlslist = urlslist.split("\n")
for urlslist in urlslist:

htmltext = urllib2.urlopen("www..."+ urlslist)
data = json.load(htmltext)

I have also tried using some headers and such but get the error 'ValueError: No JSON object could be decoded': 我也尝试过使用某些标头,但这样会得到错误“ ValueError:无法解码JSON对象”:

req = urllib2.Request('https://www....)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36')

htmltext = urllib2.urlopen(req)
data = json.load(htmltext)

I am stumped, any help? 我感到难过,有什么帮助吗?

When you request a url, you need to include the "http(s)://" part as well. 请求网址时,还需要包括“ http(s)://”部分。 Assuming that the text file you have just contains the "name.com" part of the url (eg instead of https://www.google.com , your text file has google.com), this is the code you need: 假设您刚才的文本文件包含网址的“ name.com”部分(例如,您的文本文件不是https://www.google.com ,而是google.com),这是您需要的代码:

htmltext = urllib2.urlopen("https://www." + urlslist)

If the url is the stubhub.com (as you mentioned in your comment) one, you don't need the "s." 如果网址是stubhub.com(如您在评论中提到的),则不需要“ s”。 It would be this instead: 而是这样的:

htmltext = urllib2.urlopen("http://www." + urlslist)

The json error may simply be due to the fact that there are no json files to load. json错误可能仅仅是由于没有json文件要加载的事实。 You'll need to take a look at the developer panel and make sure that json format files are being brought in. 您需要查看开发人员面板,并确保引入json格式文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM