简体   繁体   中英

Scrape password protected website with no token

(I'm sorry for my english i'll try to do my best) :

I'm a newbie in python and i'm seeking for help for some web scraping. I already have a functionable code to get the links i want but the website is protected by a password. with the help of a lot of question i read i managed to get a working code to scrape the website after the login but the links i want are on another page :

the login page is http://fantasy.trashtalk.co/login.php

the landing page (the one i scrape with this code) after login is http://fantasy.trashtalk.co/

and the page i want is http://fantasy.trashtalk.co/?tpl=classement&t=1

So i have this code (some import are probably useless, they come from another code):

from bs4 import BeautifulSoup
import requests
from lxml import html
import urllib.request
import re

username = 'myusername'
password = 'mypass'
url = "http://fantasy.trashtalk.co/?tpl=classement&t=1"
log = "http://fantasy.trashtalk.co/login.php"

values = {'email': username,
          'password': password}

r = requests.post(log, data=values)

# Not sure about the code below but it works.
data = r.text

soup = BeautifulSoup(data, 'lxml')

tags = soup.find_all('a')

for link in soup.findAll('a', attrs={'href': re.compile("^https://")}):
    print(link.get('href'))

I understand that this code only allow me to access to the login page then scrape what come next (the landing page) but i don't figure out how to "save" my loggin info to access the page i want to scrape.

i think i should add something like this after the login code but when i do it it only scrape my links from the login page :

s = request.get(url)

Also i read some topic here using "with session" thing ? but i didn't managed to make it work.

Any of help would be appreciated. Thank you for your time.

The issue was that you needed to save your login credentials by posting them through a session object, not a request object. I've modified your code below and you now have access to the html tags located in the scrape_url page. Good luck!

import requests
from bs4 import BeautifulSoup

username = 'email'
password = 'password'
scrape_url = 'http://fantasy.trashtalk.co/?tpl=classement&t=1'

login_url = 'http://fantasy.trashtalk.co/login.php'
login_info = {'email': username,'password': password}

#Start session.
session = requests.session()

#Login using your authentication information.
session.post(url=login_url, data=login_info)

#Request page you want to scrape.
url = session.get(url=scrape_url)

soup = BeautifulSoup(url.content, 'html.parser')

for link in soup.findAll('a'):
    print('\nLink href: ' + link['href'])
    print('Link text: ' + link.text)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM