简体   繁体   中英

Python web scraping with requests - status code 200, but no successful login

ready to ask my first question and a novice in programming, so please bear with me: I'm trying to do some fun web-scraping with Python and the requests -library. However, I cannot login to the website successfully. Here is the html-code of the form on the site I'm trying to log into:

<body class="login"><div class="wrapper">
    
        <div class="wrapper_forms">
    <!-- Login Form --> 
            
                    <form class="login" action="?" method="post">
                        <fieldset>
                            <legend>Login</legend>
                            <label for="login_username">Benutzername:</label>
                            <input id="login_username" type="text" name="username" autofocus="autofocus" /><em>*</em><br>
                            <label for="login_password">Passwort:</label>
                            <input id="login_password" type="password" name="password" /><em>*</em><br>
                            <em>* required</em><br>
                            <input type="hidden" name="form" value="login" />
                            <input id="screen_width" type="hidden" name="screen_width" value="" />
                            <input type="submit" value="Login" />
                            <br><a class="pw_reset" href="?action=pw_reset">Passwort vergessen</a>
                            <br>
                        </fieldset>
                    </form>

Here is an abbreviated version of my code. I also ran some extra checks on the response-object of my get and post requests that I have excluded for readability. The site currently doesn't have a valid SSL certificate, hence the verify = False argument.

from bs4 import BeautifulSoup
import requests


# Start session
session = requests.Session()

# payload
payload = {'username':'****',
           'password':'****',
           'form':'login',
           'screen_width':'1920'}

# variables
login = "https://altklausurendb.de/login.php"
dest = "https://altklausurendb.de/index.php"

# initial get request to retrieve cookies and headers
s = session.get(login, verify = False)

cookies = s.cookies.get_dict()
headers = s.headers

# distribute payload
r1 = session.post(login, data = payload, cookies = cookies, headers = headers, verify = False)

#download content of destination site
r2 = session.get(dest, cookies = cookies, verify = False)

soup = BeautifulSoup(r2.text, 'html.parser')

print(soup)

The output of the print function gives me back the html-form above, not the content of the https://altklausurendb.de/index.php which is the dest -variable for the second GET -request.

Things I have tried on my own after a lot of reading on here, reddit and the rest of the WWW:

  • running it all without explicitly passing cookies and headers in the POST -request (obvious)
  • logging in manually, checking the POST -request that is generated when I log into the website, using the Chrome Dev-Tools under the network tap. Even manually copied the headers in a separate version of the script, in case requests header function somehow provides the wrong headers.
  • using HTTPBasicAuth for the login-procedure
  • running all the checks on the response -object created from the initial GET and POST requests, interestingly the Status-Code is always 200 , no matter what username and password I provide in the dictionary --> this leads me to my suspicion, that the POST -request is not properly submitted if the website doesn't respond with 403 for wrong credentials.

I see that some website have also a name -variable in the <input type="submit"> statement that you have to pass to the dictionary. However this website does not seem to have it and basically this is where I am stuck right now.

I'm willing to DM somebody the credentials for a test-account I created on the website, if they want to reproduce the behavior of the script themselves.

Thanks for your patience, I know web scraping questions are not the most favorite of all and I appreciate your help!

Toss all of that out and get the selenium module. BS4 is good for scraping HTML, but if you're going to scrape dynamic pages you need selenium.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM