简体   繁体   中英

How to download file from website that requires login information using Python?

I am trying to download some data from a website using Python. If you simply copy and paste the url, it shows nothing unless you fill in the login information. I have the login name and password, however how should I include these in Python?

My current code is:

import urllib, urllib2, cookielib

username = my_user_name
password = my_pwd

link = 'www.google.com' # just for instance
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({'username' : username, 'j_password' : password})

opener.open(link, login_data)
resp = opener.open(link,login_data)
print resp.read()

There is no error pops out, however resp.read() is a bunch of CSS and it only has the messages like "you have to login before reading news here."

So how can I retrieve the page that after logging in?

Just noticed that the website requires 3 entries:

Company: 

Username: 

Password:

I have all of them but how can I put all three in the login variable?

If I run it without login it returns:

cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

opener.open(dd)
resp = opener.open(dd)

print resp.read()

Here is the print-outs:

<DIV id=header>
<DIV id=strapline><!-- login_display -->
<P><FONT color=#000000>All third party users of this website and/or data produced by the Baltic do so at their own risk. The Baltic owes no duty of care or any other obligation to any party other than the contractual obligations which it owes to its direct contractual partners. </FONT></P><IMG src="images/top-strap.gif"> <!-- template [strapline]--></DIV><!-- end strapline -->
<DIV id=memberNav>
<FORM class=members id=form1 name=form1 action=client_login/client_authorise.asp?action=login method=post onsubmits="return check()">

Usign scrapy for crawling that data, Scrapy

And then you can just do this

class LoginSpider(Spider):
    name = 'example.com'
    start_urls = ['http://www.example.com/users/login.php']

    def parse(self, response):
        return [FormRequest.from_response(response,
                    formdata={'username': 'john', 'password': 'secret'},
                    callback=self.after_login)]

    def after_login(self, response):
        # check login succeed before going on
        if "authentication failed" in response.body:
            self.log("Login failed", level=log.ERROR)
            return

This code should work, using Python-Requests - just replace the ... with the actual domain and of course the login data.

from requests import Session

s = Session() # this session will hold the cookies

# here we first login and get our session cookie
s.post("http://.../client_login/client_authorise.asp?action=login", {"companyName":"some_company", "password":"some_password", "username":"some_user", "status":""})

# now we're logged in and can request any page
resp = s.get("http://.../").text

print(resp)

Try using another useragent in headers. It looks like the site has some type of scraper detection, you didnt provide the url to check for that. Some sites make javascript tests to check if the request looks automated, in that case go for playwright or selenium.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM