简体   繁体   中英

How to implement login handover from mechanize to pycurl

I need to login into a website by using mechanize in python and then continue traversing that website using pycurl. So what I need to know is how to transfer a logged-in state established via mechanize into pycurl. I assume it's not just about copying the cookie over. Or is it? Code examples are valued ;)

Why I'm not willing to use pycurl alone: I have time constraints and my mechanize code worked after 5 minutes of modifying this example as follows:

import mechanize
import cookielib

# Browser
br = mechanize.Browser()

# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)

# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)

# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
# debugging messages?
#br.set_debug_http(True)
#br.set_debug_redirects(True)
#br.set_debug_responses(True)

# User-Agent (this is cheating, ok?)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]

# Open the site
r = br.open('https://thewebsite.com')
html = r.read()

# Show the source
print html
# or
print br.response().read()

# Show the html title
print br.title()

# Show the response headers
print r.info()
# or
print br.response().info()

# Show the available forms
for f in br.forms():
    print f

# Select the first (index zero) form
br.select_form(nr=0)

# Let's search
br.form['username']='someusername'
br.form['password']='somepwd'
br.submit()

print br.response().read()

# Looking at some results in link format
for l in br.links(url_regex='\.com'):
    print l

Now if I could only transfer the right information from br object to pycurl I would be done.

Why I'm not willing to use mechanize alone: Mechanize is based on urllib and urllib is a nightmare. I had too many traumatizing issues with it. I can swallow one or two calls in order to login, but please no more. In contrast pycurl has proven for me to be stable, customizable and fast. From my experience, pycurl to urllib is like star trek to flintstones.

PS: In case anyone wonders, I use BeautifulSoup once I have the html

Solved it. Appartently it WAS all about the cookie. Here is my code to get the cookie:

import cookielib
import mechanize

def getNewLoginCookieFromSomeWebsite(username = 'someusername', pwd = 'somepwd'):
    """
    returns a login cookie for somewebsite.com by using mechanize
    """
    # Browser
    br = mechanize.Browser()

    # Cookie Jar
    cj = cookielib.LWPCookieJar()
    br.set_cookiejar(cj)

    # Browser options
    br.set_handle_equiv(True)
    br.set_handle_gzip(True)
    br.set_handle_redirect(True)
    br.set_handle_referer(True)
    br.set_handle_robots(False)

    # Follows refresh 0 but does not hang on refresh > 0
    br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)

    # User-Agent
    br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:26.0) Gecko/20100101 Firefox/26.0')]

    # Open login site
    response = br.open('https://www.somewebsite.com')

    # Select the first (index zero) form
    br.select_form(nr=0)

    # Enter credentials
    br.form['user']=username
    br.form['password']=pwd
    br.submit()

    cookiestr = ""
    for c in br._ua_handlers['_cookies'].cookiejar:
        cookiestr+=c.name+'='+c.value+';'

    return cookiestr

In order to activate the usage of that cookie when using pycurl, all you have to do is to type the following before c.perform() occurs:

c.setopt(pycurl.COOKIE, getNewLoginCookieFromSomeWebsite("username", "pwd"))

Keep in mind: some websites may keep interacting with the cookie via Set-Content and pycurl (unlike mechanize) does not automatically execute any operations on cookies. Pycurl simply receives the string and leaves to the user what to do with it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM