简体   繁体   English

使用 python 请求库登录网页,状态码 200,解析失败且未经授权的页面仍然未经授权

[英]Login to webpage with python requests library, status code 200, failing to parse and unauthorized pages still unauthorized

I'm attempting to build my first web scraper, which is aiming to scrape some data tables from a website and use them to populate pandas dataframes.我正在尝试构建我的第一个 web 刮板,旨在从网站上刮取一些数据表并使用它们来填充 pandas 数据帧。 The website requires a login.该网站需要登录。 The webiste is called spotrac.com .该网站名为spotrac.com

I'm running into a couple of issues.我遇到了几个问题。 The first is that when I run the beautifulsoup post method I get the following error: "InvalidURL: Failed to parse: <Response [200]>".第一个是当我运行 beautifulsoup 发布方法时,我收到以下错误:“InvalidURL: Failed to parse: <Response [200]>”。 This error occurs when I run the following code:当我运行以下代码时会发生此错误:

import pandas as pd
import requests, lxml.html
from bs4 import BeautifulSoup

with requests.session() as s:    
    login = s.get('http://www.spotrac.com/signin/submit/')
    form = {'redirect' : '',
            'email' : 'my_email@gmail.com',
            'password' : 'my_password',
            }    

    p = s.post(login, data=form)

    r1 = s.get('http://www.spotrac.com/nfl/rankings/2019/base/quarterback/')
    soup1 = BeautifulSoup(r1.content,'lxml')
    table1 = soup1.find_all('table')[0]
    df1 = pd.read_html(str(table1))
    
    r2 = s.get('http://www.spotrac.com/nfl/rankings/2018/base/quarterback/')
    soup2 = BeautifulSoup(r2.content,'lxml')
    table2 = soup2.find_all('table')[0]
    df2 = pd.read_html(str(table2))
    
    print(r1.status_code) # 200
    print(r2.status_code) # 200

To allow my code to continue running, I put the post line in a try block, as follows:为了让我的代码继续运行,我将 post 行放在 try 块中,如下所示:

with requests.session() as s:    
    try:
        p = s.post(login, data=form)
    except:
        pass

This does allow my code to continue running, and when I print out the status code, it shows 200. I believe this confirms that I am logged in?这确实允许我的代码继续运行,当我打印出状态码时,它显示 200。我相信这证实了我已登录?

The issue I run into after this is that when I go to a webpage that is unauthorized, the datatables aren't populating as if I am signed in. In my code, you'll see r1 and r2.在此之后我遇到的问题是,当我 go 访问未经授权的网页时,数据表没有像我登录一样填充。在我的代码中,你会看到 r1 和 r2。 When I go through the process of getting the datatable from r1, I have no issue.当我通过从 r1 获取数据表的过程 go 时,我没有问题。 I believe that is because it is publicly available .我相信那是因为它是公开的 When I try the same process for r2, I get the following error: "IndexError: list index out of range".当我为 r2 尝试相同的过程时,我收到以下错误:“IndexError: list index out of range”。 This error occurs because the webpage does not load the data table, as the data table is only available to premium customers who have logged in. I know this is the case, because if you go to the r1 webpage ( http://www.spotrac.com/nfl/rankings/2019/base/quarterback/ ), you'll see a datatable without logging in. If you attempt to go to the r2 webpage ( http://www.spotrac.com/nfl/rankings/2018/base/quarterback/ ), you will not see a datatable.出现这个错误是因为网页没有加载数据表,因为数据表只对已登录的高级客户可用。我知道是这种情况,因为如果你 go 到 r1 网页( http://www. spotrac.com/nfl/rankings/2019/base/quarterback/ ),您将在不登录的情况下看到一个数据表。如果您尝试 go 到 r2 网页( http://www.spotrac.com/nfl/ranking/ 2018/base/quarterback/ ),您将不到数据表。 Instead, you get redirected to a login page.相反,您会被重定向到登录页面。

Any help you can provide would be greatly appreciated.您能提供的任何帮助将不胜感激。 I really don't know how to proceed at this point.我真的不知道如何在这一点上进行。

Thank you, David谢谢你,大卫

This was actually a very silly mistake.这实际上是一个非常愚蠢的错误。 The issue is that I was passing the following "login" variable to the post method, rather than the login url:问题是我将以下“登录”变量传递给 post 方法,而不是登录 url:

login = s.get('http://www.spotrac.com/signin/submit/')
p = s.post(login, data=form)

I still wanted to run the get method, so I just assigned the url to a separate variable.我仍然想运行 get 方法,所以我只是将 url 分配给了一个单独的变量。 Simple fix:简单修复:

login = 'http://www.spotrac.com/signin/submit/'
r = s.get(login)
p = s.post(login, data=form)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM