简体   繁体   English

带有请求的Python Web抓取-登录后

[英]Python web scraping with requests - after login

I have a python requests/beatiful soup code below which enables me to login to a url successfully. 我在下面有一个python请求/佳汤代码,可让我成功登录到url。 However, after logon, to get the data I need would normally have to manually have to: 但是,登录后,要获取我需要的数据,通常必须手动进行以下操作:

1) click on 'statement' in the first row: 1)点击第一行中的“声明”:

在此处输入图片说明

2) Select dates, click 'run statement': 2)选择日期,单击“运行语句”:

在此处输入图片说明

3) view data: 3)查看数据:

在此处输入图片说明

This is the code that I have used to logon to get to step 1 above: 这是我用来登录到上面的步骤1的代码:

import requests
from bs4 import BeautifulSoup

logurl = "https://login.flash.co.za/apex/f?p=pwfone:login"
posturl = 'https://login.flash.co.za/apex/wwv_flow.accept'

with requests.Session() as s:
    s.headers = {"User-Agent":"Mozilla/5.0"}
    res = s.get(logurl)
    soup = BeautifulSoup(res.text,"html.parser")

    arg_names =[]
    for name in  soup.select("[name='p_arg_names']"):
        arg_names.append(name['value'])

    values = {
        'p_flow_id': soup.select_one("[name='p_flow_id']")['value'],
        'p_flow_step_id': soup.select_one("[name='p_flow_step_id']")['value'],
        'p_instance': soup.select_one("[name='p_instance']")['value'],
        'p_page_submission_id': soup.select_one("[name='p_page_submission_id']")['value'],
        'p_request': 'LOGIN',
        'p_t01': 'solar',
        'p_arg_names': arg_names,
        'p_t02': 'password',
        'p_md5_checksum': soup.select_one("[name='p_md5_checksum']")['value'],
        'p_page_checksum': soup.select_one("[name='p_page_checksum']")['value']
    }
    s.headers.update({'Referer': logurl})
    r = s.post(posturl, data=values)
    print (r.content)

My question is, (beginner speaking), how could I skip steps 1 and 2 and simply do another headers update and post using the final URL using selected dates as form entries (headers and form info below)? 我的问题是(初学者),我如何跳过第1步和第2步,并简单地使用选定的日期作为表单条目(下面的标题和表单信息)使用最终的URL更新和发布另一个标题? (The referral header is step 2 above): referral header是上面的第2步):

在此处输入图片说明 ] ]

Edit 1: network request from csv file download: 编辑1:从csv文件下载的网络请求:

在此处输入图片说明

使用Selenium WebDriver,它具有很多很好的功能来处理Web服务。

Selenium is gonna be your best bet for automated browser interactions. 硒将是自动浏览器交互的最佳选择。 It can be used not only to scrape data from websites but also to interact with different forms and such. 它不仅可以用于从网站抓取数据,还可以与其他形式进行交互。 I highly recommended it as I have used it quite a bit in the past. 我强烈推荐它,因为我过去已经使用了很多。 If you already have pip and python installed go ahead and type 如果您已经安装了pip和python,请继续输入

pip install selenium 点安装硒

That will install selenium but you also need to install either geckodriver (for Firefox) or chromedriver (for chrome) Then you should be up and running! 这将安装硒,但您还需要安装geckodriver(对于Firefox)或chromedriver(对于chrome),然后就可以启动并运行了!

As others have recommended, Selenium is a good tool for this sort of task. 正如其他人所建议的那样,Selenium是完成此类任务的好工具。 However, I'd try to suggest a way to use requests for this purpose as that's what you asked for in the question. 但是,我会尝试建议一种用于此目的的requests ,因为这就是您在问题中所要的。

The success of this approach would really depend on how the webpage is built and how data files are made available (if "Save as CSV" in the view data is what you're targeting). 这种方法的成功真正取决于网页的构建方式以及如何提供数据文件(如果您所针对的是视图数据中的“另存为CSV”)。

If the login mechanism is cookie-based, you can use Sessions and Cookies in requests. 如果登录机制是基于Cookie的,则可以在请求中使用会话Cookies When you submit a login form, a cookie is returned in the response headers. 当您提交登录表单时,响应头中将返回一个cookie。 You add the cookie to request headers in any subsequent page requests to make your login stick. 您可以在随后的任何页面请求中将cookie添加到请求标头中,以使您的登录名正确。

Also, you should inspect the network request for "Save as CSV" action in the Developer Tools network pane. 另外,您应该在开发人员工具网络窗格中检查网络请求“另存为CSV”操作。 If you can see a structure to the request, you may be able to make a direct request within your authenticated session, and use a statement identifier and dates as the payload to get your results. 如果可以看到请求的结构,则可以在经过身份验证的会话中发出直接请求,并使用语句标识符和日期作为有效负载来获取结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM