简体   繁体   English

使用Python cookie获取HTML源代码

[英]Obtaining HTML source code with Python cookie

    import urllib

    #my url here stored as url

    htmlfile = urllib.urlopen(url)
    htmltext = htmlfile.read()
    print(htmltext)

I'm trying to get source code from a url 我正在尝试从网址获取源代码

I get source code but it is from a different page saying two things; 我得到了源代码,但它来自不同的页面,说两件事; please enable cookies and this domain has banned your access based on your browser's signature 请启用Cookie,此域名已根据您的浏览器签名禁止访问

Is there any way that anyone knows of to get the source code when the browser knows your not actually on the page? 当浏览器知道你实际上不在页面上时,有没有人知道获取源代码的方法?

You may have to set an url opener 您可能需要设置一个url开启者

def createOpener(self):
        handlers = []                                                       
        cj = MyCookieJar();
        cj.set_policy(cookielib.DefaultCookiePolicy(rfc2965=True))
        cjhdr = urllib2.HTTPCookieProcessor(cj)
        handlers.append(cjhdr)
        opener = urllib2.build_opener(*handlers)
        opener.addheaders = [('User-Agent', self.getUserAgent()),
                                  ('Host', 'google.com')]
        return opener

where the cookie jar is 饼干罐在哪里

class MyCookieJar(cookielib.CookieJar):
    def _cookie_from_cookie_tuple(self, tup, request):
        name, value, standard, rest = tup
        version = standard.get('version', None)
        if version is not None:
            version = version.replace('"', '')
            standard["version"] = version
        return cookielib.CookieJar._cookie_from_cookie_tuple(self, tup, request)

At this point you create the opener and fetch the data reading the url handler like: 此时,您创建了opener并获取读取url处理程序的数据,如:

def fetchURL(self, url, data=None, headers={}):
        request = urllib2.Request(url, data, headers)
        self.opener = self.createOpener()
        urlHandle = self.opener.open(request)
        return urlHandle.read()

It's a good idea to have a User-Agent list and read from it: 拥有User-Agent列表并从中读取是个好主意:

with open(ffpath) as f:
    USER_AGENTS_LIST = f.read().splitlines()

and get a random one from it 从中获取一个随机的

index = random.randint(0,len(USER_AGENTS_LIST)-1)
uA=USER_AGENTS_LIST[index]

To have a list of user agent take a look at here . 要有一个用户代理列表,请看这里

This is just to have and idea to do this without any external framework. 这只是为了在没有任何外部框架的情况下做到这一点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM