简体   繁体   English

如何保持登录网站并发送cookie或http标头或我的python程序之类的东西?

[英]How can I stay logged in to a website and send a cookie or http header or something for my python program?

Ok. 好。 This is what my program (in Python) does. 这就是我的程序(在Python中)所做的。 First it downloads the html of a page on this website, locationary.com, and then it takes all of the business pages from that page and finds all of the yellowpages.com links for those businesses. 首先,它在此网站locationary.com上下载页面的html,然后从该页面获取所有业务页面,并找到这些业务的所有yellowpages.com链接。 After it finds them it inserts them into the website with the webbrowser module. 找到它们后,便使用webbrowser模块将它们插入网站。 (If you read through the code, this will probably make a LOT more sense). (如果您通读了代码,这可能会更有意义)。

First of all, I want a way to submit the yellowpages.com link without the webbrowser module because when I do I have to make it so Firefox is always logged in to locationary.com and then I had to figure out a way to close firefox so that it doesn't get overloaded with tabs. 首先,我想要一种无需使用webbrowser模块即可提交yellowpages.com链接的方法,因为当我这样做时,必须使Firefox始终登录到locationary.com,然后我必须找出一种方法来关闭firefox这样它就不会因标签而超载。 I tried to use urllib2 and urlopen but it didn't do anything when I ran it. 我尝试使用urllib2和urlopen,但是运行它时却什么也没做。 Now I'm starting to think that I need to send some type of cookie or http header with my request. 现在,我开始认为我需要在请求中发送某种类型的Cookie或http标头。 How do you do this? 你怎么做到这一点?

If something doesn't make sense, please ask me to clarify! 如果没有任何意义,请让我澄清!

Here is my code: 这是我的代码:

from urllib import urlopen
from gzip import GzipFile
from cStringIO import StringIO
import re
import urllib
import urllib2
import webbrowser
import mechanize
import time
from difflib import SequenceMatcher
import os

def download(url):
    s = urlopen(url).read()
    if s[:2] == '\x1f\x8b': # assume it's gzipped data
        with GzipFile(mode='rb', fileobj=StringIO(s)) as ifh:
            s = ifh.read()
    return s

for t in range(0, 1):
    s = download('http://www.locationary.com/place/en/US/Arizona/Phoenix-page7/?ACTION_TOKEN=NumericAction')
    findTitle = re.compile('<title>(.*)</title>')
    getTitle = re.findall(findTitle,s)    
    findLoc = re.compile('http://www\.locationary\.com/place/en/US/.{1,50}/.{1,50}/.{1,100}\.jsp')
    findLocL = re.findall(findLoc,s)

    W, X, XA, Y, YA, Z, ZA = [], [], [], [], [], [], []

    for i in range(2, 25):
        print i

        b = download(findLocL[i])
        findYP = re.compile('http://www\.yellowpages\.com/')
        findYPL = re.findall(findYP,b)
        findTitle = re.compile('<title>(.*) \(\d{1,10}.{1,100}\)</title>')
        getTitle = re.findall(findTitle,b)
        findAddress = re.compile('<title>.{1,100}\((.*), .{4,14}, United States\)</title>')
        getAddress = re.findall(findAddress,b)
        if not findYPL:
            if not getTitle:
                print ""
            else:
                W.append(findLocL[i])

            b = download(findLocL[i])

            if not getTitle:
                print ""
            else:
                X.append(getAddress)

            b = download(findLocL[i])

            if not getTitle:    
                print ""
            else:
                Y.append(getTitle)

    sizeWXY = len(W)

    def XReplace(text, d):
        for (k, v) in d.iteritems():
            text = text.replace(k, v)  
        XA.append(text)

    def XReplace(text, d):
        for (k, v) in d.iteritems():
            text = text.replace(k, v)  
        YA.append(text)

    for d in range(0, sizeWXY):
        old = str(X[d])
        reps = {' ':'-', ',':'', '\'':'', '[':'', ']':''}
        XReplace(old, reps)
        old2 = str(Y[d])
        YReplace(old2, reps)

    count = 0

    for e in range(0, sizeWXY):
        newYPL = "http://www.yellowpages.com/" + XA[e] + "/" + YA[e] + "?order=distance"
        v = download(newYPL)
        abc = str('<h3 class="business-name fn org">\n<a href="')
        dfe = str('" class="no-tracks url "')
        findFinal = re.compile(abc + '(.*)' + dfe)
        getFinal = re.findall(findFinal, v)

        if not getFinal:
            W.remove(W[(e-count)])
            X.remove(X[(e-count)])
            count = (count+1)
        else:
            for f in range(0,1):
                Z.append(getFinal[f])

    XA = []
    for c in range(0,(len(X))):
        aGd = re.compile('(.*), .{1,50}')
        bGd = re.findall(aGd, str(X[c]))
        XA.append(bGd)

    LenZ = len(Z)

    V = []
    for i in range(0, (len(W))):
        if i == 0:
            countTwo = 0

        gda = download(Z[i-(countTwo)])
        ab = str('"street-address">\n')
        cd = str('\n</span>')
        ZAddress = re.compile(ab + '(.*)' + cd)
        ZAddress2 = re.findall(ZAddress, gda)

        for b in range(0,(len(ZAddress2))):
            if not ZAddress2[b]:
                print ""
            else:
                V.append(str(ZAddress2[b]))
                a = str(W[i-(countTwo)])
                n = str(Z[i-(countTwo)])
                c = str(XA[i])
                d = str(V[i])
                m = SequenceMatcher(None, c, d)

                if m.ratio() < 0.50:
                    Z.remove(Z[i-(countTwo)])
                    W.remove(W[i-(countTwo)])
                    countTwo = (countTwo+1)

    def ZReplace(text3, dic3):
        for p, q in dic3.iteritems():
            text3 = text3.replace(p, q)  
        ZA.append(text3)

    for y in range(0,len(Z)):
        old3 = str(Z[y])
        reps2 = {':':'%3A', '/':'%2F', '?':'%3F', '=':'%3D'}
        ZReplace(old3, reps2)

    for z in range(0,len(ZA)):
        findPID = re.compile('\d{5,20}')
        getPID = re.findall(findPID,str(W[z]))
        newPID = re.sub("\D", "", str(getPID))
        finalURL = "http://www.locationary.com/access/proxy.jsp?ACTION_TOKEN=proxy_jsp$JspView$SaveAction&inPlaceID=" + str(newPID) + "&xxx_c_1_f_987=" + str(ZA[z])
        webbrowser.open(finalURL)
        time.sleep(5)

    os.system("taskkill /F /IM firefox.exe")

You might try pycurl which will allow you to interact with web pages without using a browser. 您可以尝试pycurl ,它允许您无需使用浏览器即可与网页进行交互。

pycurl allows cookie storage (and reuse), form submission (including login), as well as page navigation from Python. pycurl允许cookie存储(和重用),表单提交(包括登录)以及来自Python的页面导航。 Your code above already seems to retrieve the resources you want via urllib. 上面的代码似乎已经可以通过urllib检索所需的资源。 You can easily adapt your code by following the pycurl tutorials at the pycurl site. 您可以按照pycurl站点上的pycurl教程轻松修改代码。

Regarding your comment about the need to handle javascript, please elaborate about the context of that. 关于您需要处理javascript的评论,请详细说明其上下文。

A different but related post here at Stackoverflow addresses a javascript question with pycurl as well as showing code to initiate a pycurl connection. 在Stackoverflow上的另一篇相关文章中有关pycurl的javascript问题以及显示了启动pycurl连接的代码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM