如何在Python3中使用BeautifulSoup爬取需要登錄的網站

Question

我正在嘗試解析來自“https://financialpost.com/”的文章，下面提供了示例鏈接。 要解析這個，我需要登錄到他們的網站。

我確實成功地發布了我的證書，但是，它仍然沒有解析整個網頁，只是開始。

我如何抓取所有內容？

import requests
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen

link = 'https://financialpost.com/sign-in/'

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'
    res = s.get(link)
    soup = BeautifulSoup(res.text,'html.parser')   
    payload = {i['email']:i.get('value','') for i in soup.select('input[email]')}         
    payload['email'] = 'email@email.com'
    payload['password'] = 'my_password'
    s.post(link,data=payload)

url = 'https://financialpost.com/pmn/business-pmn/hydrogen-is-every-u-s-gas-utilitys-favorite-hail-mary-pass'
content_url = Request(url)
article_content = urlopen(content_url).read()
article_soup = BeautifulSoup(article_content, 'html.parser')
article_table = article_soup.findAll('section',attrs={'class':'article-content__content-group'})
for x in article_table:
    print(x.find('p').text)

Answer 1

僅使用`requests`

僅使用請求有點復雜，但可能，您必須先進行身份驗證才能獲得身份驗證令牌，然后您會要求使用所述令牌的文章，以便站點知道您已通過身份驗證並顯示完整文章。要找出哪些 API 端點被用於驗證和加載網站內容，您可以使用 chrome 開發工具或 fiddler 之類的工具（它們可以記錄所有 HTTP 請求，因此您可以手動找到感興趣的請求）

僅使用`selenium`

更簡單的方法是只使用Selenium 。 它是一個可以代碼使用的瀏覽器，這樣你就可以打開登錄網站驗證並請求文章，網站就會認為你是人。

如何在Python3中使用BeautifulSoup爬取需要登錄的網站

問題描述

1 個解決方案

解決方案1
0 2022-05-04 14:50:08

僅使用`requests`

僅使用`selenium`

如何在Python3中使用BeautifulSoup爬取需要登錄的網站

問題描述

1 個解決方案

解決方案1 0 2022-05-04 14:50:08

僅使用requests

僅使用selenium

解決方案1
0 2022-05-04 14:50:08

僅使用`requests`

僅使用`selenium`