简体   繁体   中英

How to crawl a website that requires login using BeautifulSoup in Python3

I'm trying to parse articles from 'https://financialpost.com/', and example link is provided below. To parse this, i need to login to their website.

I do successfully post my cresidentials, however, it still do not parse the entire webpage, just the beginning.

How do I crawl everything?

import requests
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen

link = 'https://financialpost.com/sign-in/'

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'
    res = s.get(link)
    soup = BeautifulSoup(res.text,'html.parser')   
    payload = {i['email']:i.get('value','') for i in soup.select('input[email]')}         
    payload['email'] = 'email@email.com'
    payload['password'] = 'my_password'
    s.post(link,data=payload)

url = 'https://financialpost.com/pmn/business-pmn/hydrogen-is-every-u-s-gas-utilitys-favorite-hail-mary-pass'
content_url = Request(url)
article_content = urlopen(content_url).read()
article_soup = BeautifulSoup(article_content, 'html.parser')
article_table = article_soup.findAll('section',attrs={'class':'article-content__content-group'})
for x in article_table:
    print(x.find('p').text)

Using just requests

It's a bit complicated using just requests but possible, you would have to first authenticate to get authentication token, then you would ask for the article with said token so that site will know that you are authenticated and will display full article. To find out which API endpoints are being used to authenticate and load website content you can use something like chrome dev tools or fiddler (they can record all HTTP request so you can find manually interesting ones)

Using just selenium

Easier way would be to just use Selenium . It is a browser that can be used by code, so that you can just open login website authenticate and request for the article and the site would think that you are a human.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM