[英]Using urllib.request.urlopen returns blank but from Chrome I can see there is data in Response
It is my first question and appreciate that you could provide some hints to me. 这是我的第一个问题,感谢您可以向我提供一些提示。
I am developing a spider using python to crawl the odds entry from a website. 我正在使用python开发一个Spider,以从网站抓取赔率条目。 In that website, there is a onclick event to pop up a window to show the change of odds.
在该网站中,有一个onclick事件会弹出一个窗口,以显示赔率的变化。 I checked from Chrome that it links to a url, " http://odds.500.com/fenxi1/inc/yazhiajax.php?fid=554629&id=3&t= " + str(t) + "&r=1" Here t is a Javascript (new Date).getTime().
我从Chrome浏览器检查了它是否链接到网址,“ http://odds.500.com/fenxi1/inc/yazhiajax.php?fid=554629&id=3&t= “ + str(t)+”&r = 1“在这里t是一个Javascript(新Date).getTime()。 I can see the odds change from Chrome's preview and response.
我可以从Chrome的预览和响应中看到几率的变化。 However, when I run below code to fetch the data.
但是,当我运行下面的代码来获取数据时。 It shows blank.
它显示为空白。 And when I navigate to the url through Chrome, it shows blank either.
当我通过Chrome浏览到该网址时,它也显示为空白。
enter image description here 在此处输入图片说明
import datetime
import re
import urllib.request
import lottery
import gzip
import time
# getTime simulates Javascript getTime function
def getTime():
return int(time.time() * 1000)
user_agent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36'"
referer = "http://odds.500.com/fenxi/yazhi-554629.shtml"
headers = {'User-Agent': user_agent, 'Referer': referer}
t = getTime()
url_str = "http://odds.500.com/fenxi1/inc/yazhiajax.php?fid=554629&id=3&t=" + str(t) + "&r=1"
print(url_str)
req = urllib.request.Request(url_str, headers = headers)
response = urllib.request.urlopen(req).read()
print(response)
The HTTP request does not consist only of URL. HTTP请求不仅包含URL。 You can see in Chrome's Developer tools (ctrl+shift+I) all requests and responses in Network tab.
您可以在Chrome的开发者工具(ctrl + shift + I)中的“网络”标签中查看所有请求和响应。
I opened your referer URL ( http://odds.500.com/fenxi/yazhi-554629.shtml ) in chrome and clicked one of the items in "盘" column. 我在Chrome中打开了您的引荐来源网址( http://odds.500.com/fenxi/yazhi-554629.shtml ),然后单击了“盘”列中的一项。 I believe that is what you are trying to mimic.
我相信这就是您要模仿的东西。 It sent a request with many cookies.
它发送了一个包含许多cookie的请求。 That is probably you problem.
那可能是你的问题。
You should probably make your crawler open the first URL, gather all the cookies and then make the second requset with the cookies. 您可能应该使搜寻器打开第一个URL,收集所有cookie,然后使用cookie进行第二个requset。
That might be a challenge though, depending on what you have to do to gather the cookies. 但是,这可能是一个挑战,具体取决于您必须采取哪些措施来收集Cookie。
Note also that when you the response is not a complete HTML document - it is a JSON list of HTML segments. 还要注意,当响应不是完整的HTML文档时-它是HTML段的JSON列表。
EDIT: Found the Answer 编辑:找到答案
I had to check the headers again. 我不得不再次检查标题。 If you just add this header, you will get a response:
X-Requested-With: XMLHttpRequest
. 如果仅添加此标头,则将得到响应:
X-Requested-With: XMLHttpRequest
。 So: 所以:
url = 'http://odds.500.com/fenxi1/inc/yazhiajax.php?fid=554629&id=3&t=1449930953112&r=1'
headers = {'X-Requested-With': 'XMLHttpRequest'}
urllib.request.urlopen(urllib.request.Request(url, headers=headers)).read()
It returns some binary data... I'll leave that to you to decode. 它返回一些二进制数据...我将留给您解码。 Hint: respnse headers say
Content-Encoding:gzip
, so it's zipped... 提示:respnse标头中显示
Content-Encoding:gzip
,因此已压缩...
Using proper headers should resolve the issue. 使用适当的标题应该可以解决此问题。 Try checking the headers below:
尝试检查以下标题:
headers = {
'Host': "twitter.com",
'User-Agent': "Mozilla/5.0 (Windows NT 6.1; Win64; x64)",
'Accept': "application/json, text/javascript, */*; q=0.01",
'Accept-Language': "de,en-US;q=0.7,en;q=0.3",
'X-Requested-With': "XMLHttpRequest",
'Referer': url,
'Connection': "keep-alive"
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.