How to crawl a website that requires login using BeautifulSoup in Python3

Question

I'm trying to parse articles from 'https://financialpost.com/', and example link is provided below. To parse this, i need to login to their website.

I do successfully post my cresidentials, however, it still do not parse the entire webpage, just the beginning.

How do I crawl everything?

import requests
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen

link = 'https://financialpost.com/sign-in/'

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'
    res = s.get(link)
    soup = BeautifulSoup(res.text,'html.parser')   
    payload = {i['email']:i.get('value','') for i in soup.select('input[email]')}         
    payload['email'] = 'email@email.com'
    payload['password'] = 'my_password'
    s.post(link,data=payload)

url = 'https://financialpost.com/pmn/business-pmn/hydrogen-is-every-u-s-gas-utilitys-favorite-hail-mary-pass'
content_url = Request(url)
article_content = urlopen(content_url).read()
article_soup = BeautifulSoup(article_content, 'html.parser')
article_table = article_soup.findAll('section',attrs={'class':'article-content__content-group'})
for x in article_table:
    print(x.find('p').text)

Answer 1

Using just `requests`

It's a bit complicated using just requests but possible, you would have to first authenticate to get authentication token, then you would ask for the article with said token so that site will know that you are authenticated and will display full article. To find out which API endpoints are being used to authenticate and load website content you can use something like chrome dev tools or fiddler (they can record all HTTP request so you can find manually interesting ones)

Using just `selenium`

Easier way would be to just use Selenium . It is a browser that can be used by code, so that you can just open login website authenticate and request for the article and the site would think that you are a human.

How to crawl a website that requires login using BeautifulSoup in Python3

Question

1 answers

solution1
0 2022-05-04 14:50:08

Using just `requests`

Using just `selenium`

How to crawl a website that requires login using BeautifulSoup in Python3

Question

1 answers

solution1 0 2022-05-04 14:50:08

Using just requests

Using just selenium

solution1
0 2022-05-04 14:50:08

Using just `requests`

Using just `selenium`