[英]Obtaining HTML source code with Python cookie
import urllib
#my url here stored as url
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
print(htmltext)
I'm trying to get source code from a url 我正在尝试从网址获取源代码
I get source code but it is from a different page saying two things; 我得到了源代码,但它来自不同的页面,说两件事; please enable cookies and this domain has banned your access based on your browser's signature
请启用Cookie,此域名已根据您的浏览器签名禁止访问
Is there any way that anyone knows of to get the source code when the browser knows your not actually on the page? 当浏览器知道你实际上不在页面上时,有没有人知道获取源代码的方法?
You may have to set an url opener 您可能需要设置一个url开启者
def createOpener(self):
handlers = []
cj = MyCookieJar();
cj.set_policy(cookielib.DefaultCookiePolicy(rfc2965=True))
cjhdr = urllib2.HTTPCookieProcessor(cj)
handlers.append(cjhdr)
opener = urllib2.build_opener(*handlers)
opener.addheaders = [('User-Agent', self.getUserAgent()),
('Host', 'google.com')]
return opener
where the cookie jar is 饼干罐在哪里
class MyCookieJar(cookielib.CookieJar):
def _cookie_from_cookie_tuple(self, tup, request):
name, value, standard, rest = tup
version = standard.get('version', None)
if version is not None:
version = version.replace('"', '')
standard["version"] = version
return cookielib.CookieJar._cookie_from_cookie_tuple(self, tup, request)
At this point you create the opener and fetch the data reading the url handler like: 此时,您创建了opener并获取读取url处理程序的数据,如:
def fetchURL(self, url, data=None, headers={}):
request = urllib2.Request(url, data, headers)
self.opener = self.createOpener()
urlHandle = self.opener.open(request)
return urlHandle.read()
It's a good idea to have a User-Agent
list and read from it: 拥有
User-Agent
列表并从中读取是个好主意:
with open(ffpath) as f:
USER_AGENTS_LIST = f.read().splitlines()
and get a random one from it 从中获取一个随机的
index = random.randint(0,len(USER_AGENTS_LIST)-1)
uA=USER_AGENTS_LIST[index]
To have a list of user agent take a look at here . 要有一个用户代理列表,请看这里 。
This is just to have and idea to do this without any external framework. 这只是为了在没有任何外部框架的情况下做到这一点。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.