简体   繁体   English

如何在Python3中使用BeautifulSoup爬取需要登录的网站

[英]How to crawl a website that requires login using BeautifulSoup in Python3

I'm trying to parse articles from 'https://financialpost.com/', and example link is provided below.我正在尝试解析来自“https://financialpost.com/”的文章,下面提供了示例链接。 To parse this, i need to login to their website.要解析这个,我需要登录到他们的网站。

I do successfully post my cresidentials, however, it still do not parse the entire webpage, just the beginning.我确实成功地发布了我的证书,但是,它仍然没有解析整个网页,只是开始。

How do I crawl everything?我如何抓取所有内容?

import requests
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen

link = 'https://financialpost.com/sign-in/'

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'
    res = s.get(link)
    soup = BeautifulSoup(res.text,'html.parser')   
    payload = {i['email']:i.get('value','') for i in soup.select('input[email]')}         
    payload['email'] = 'email@email.com'
    payload['password'] = 'my_password'
    s.post(link,data=payload)

url = 'https://financialpost.com/pmn/business-pmn/hydrogen-is-every-u-s-gas-utilitys-favorite-hail-mary-pass'
content_url = Request(url)
article_content = urlopen(content_url).read()
article_soup = BeautifulSoup(article_content, 'html.parser')
article_table = article_soup.findAll('section',attrs={'class':'article-content__content-group'})
for x in article_table:
    print(x.find('p').text)

Using just requests仅使用requests

It's a bit complicated using just requests but possible, you would have to first authenticate to get authentication token, then you would ask for the article with said token so that site will know that you are authenticated and will display full article.仅使用请求有点复杂,但可能,您必须先进行身份验证才能获得身份验证令牌,然后您会要求使用所述令牌的文章,以便站点知道您已通过身份验证并显示完整文章。 To find out which API endpoints are being used to authenticate and load website content you can use something like chrome dev tools or fiddler (they can record all HTTP request so you can find manually interesting ones)要找出哪些 API 端点被用于验证和加载网站内容,您可以使用 chrome 开发工具或 fiddler 之类的工具(它们可以记录所有 HTTP 请求,因此您可以手动找到感兴趣的请求)

Using just selenium仅使用selenium

Easier way would be to just use Selenium .更简单的方法是只使用Selenium It is a browser that can be used by code, so that you can just open login website authenticate and request for the article and the site would think that you are a human.它是一个可以代码使用的浏览器,这样你就可以打开登录网站验证并请求文章,网站就会认为你是人。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM