简体   繁体   English

带有请求的 Python 网页抓取 - 状态代码 200,但未成功登录

[英]Python web scraping with requests - status code 200, but no successful login

ready to ask my first question and a novice in programming, so please bear with me: I'm trying to do some fun web-scraping with Python and the requests -library.准备问我的第一个问题和编程新手,所以请耐心等待:我正在尝试使用 Python 和requests -library 进行一些有趣的网络抓取。 However, I cannot login to the website successfully.但是,我无法成功登录该网站。 Here is the html-code of the form on the site I'm trying to log into:这是我尝试登录的网站上表单的 html 代码:

<body class="login"><div class="wrapper">
    
        <div class="wrapper_forms">
    <!-- Login Form --> 
            
                    <form class="login" action="?" method="post">
                        <fieldset>
                            <legend>Login</legend>
                            <label for="login_username">Benutzername:</label>
                            <input id="login_username" type="text" name="username" autofocus="autofocus" /><em>*</em><br>
                            <label for="login_password">Passwort:</label>
                            <input id="login_password" type="password" name="password" /><em>*</em><br>
                            <em>* required</em><br>
                            <input type="hidden" name="form" value="login" />
                            <input id="screen_width" type="hidden" name="screen_width" value="" />
                            <input type="submit" value="Login" />
                            <br><a class="pw_reset" href="?action=pw_reset">Passwort vergessen</a>
                            <br>
                        </fieldset>
                    </form>

Here is an abbreviated version of my code.这是我的代码的缩写版本。 I also ran some extra checks on the response-object of my get and post requests that I have excluded for readability.我还对我为了可读性而排除的 get 和 post 请求的响应对象进行了一些额外的检查。 The site currently doesn't have a valid SSL certificate, hence the verify = False argument.该站点当前没有有效的 SSL 证书,因此verify = False参数。

from bs4 import BeautifulSoup
import requests


# Start session
session = requests.Session()

# payload
payload = {'username':'****',
           'password':'****',
           'form':'login',
           'screen_width':'1920'}

# variables
login = "https://altklausurendb.de/login.php"
dest = "https://altklausurendb.de/index.php"

# initial get request to retrieve cookies and headers
s = session.get(login, verify = False)

cookies = s.cookies.get_dict()
headers = s.headers

# distribute payload
r1 = session.post(login, data = payload, cookies = cookies, headers = headers, verify = False)

#download content of destination site
r2 = session.get(dest, cookies = cookies, verify = False)

soup = BeautifulSoup(r2.text, 'html.parser')

print(soup)

The output of the print function gives me back the html-form above, not the content of the https://altklausurendb.de/index.php which is the dest -variable for the second GET -request.打印函数的输出给了我上面的 html 表单,而不是https://altklausurendb.de/index.php的内容,后者是第二个GET请求的dest变量。

Things I have tried on my own after a lot of reading on here, reddit and the rest of the WWW:在阅读了这里、reddit 和 WWW 的其他内容后,我自己尝试过的事情:

  • running it all without explicitly passing cookies and headers in the POST -request (obvious)无需在POST请求中显式传递 cookie 和标头即可运行它(显而易见)
  • logging in manually, checking the POST -request that is generated when I log into the website, using the Chrome Dev-Tools under the network tap.手动登录,检查我登录网站时生成的POST请求,使用网络点击下的 Chrome 开发工具。 Even manually copied the headers in a separate version of the script, in case requests header function somehow provides the wrong headers.甚至在脚本的单独版本中手动复制标头,以防requests header函数以某种方式提供错误的标头。
  • using HTTPBasicAuth for the login-procedure使用 HTTPBasicAuth 进行登录过程
  • running all the checks on the response -object created from the initial GET and POST requests, interestingly the Status-Code is always 200 , no matter what username and password I provide in the dictionary --> this leads me to my suspicion, that the POST -request is not properly submitted if the website doesn't respond with 403 for wrong credentials.对从初始GETPOST请求创建的response运行所有检查,有趣的是,无论我在字典中提供什么用户名和密码,状态代码始终为200 --> 这让我怀疑,如果网站没有以403响应错误凭据,则POST请求未正确提交。

I see that some website have also a name -variable in the <input type="submit"> statement that you have to pass to the dictionary.我看到某些网站在<input type="submit">语句中也有一个name -variable,您必须将其传递给字典。 However this website does not seem to have it and basically this is where I am stuck right now.然而,这个网站似乎没有它,基本上这就是我现在卡住的地方。

I'm willing to DM somebody the credentials for a test-account I created on the website, if they want to reproduce the behavior of the script themselves.如果他们想自己重现脚本的行为,我愿意将我在网站上创建的测试帐户的凭据发送给某人。

Thanks for your patience, I know web scraping questions are not the most favorite of all and I appreciate your help!感谢您的耐心等待,我知道网络抓取问题并不是最喜欢的问题,我感谢您的帮助!

Toss all of that out and get the selenium module.扔掉所有这些,得到硒模块。 BS4 is good for scraping HTML, but if you're going to scrape dynamic pages you need selenium. BS4 适合抓取 HTML,但如果您要抓取动态页面,则需要 selenium。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM