简体   繁体   English

刮刮密码保护的网站,没有令牌

[英]Scrape password protected website with no token

(I'm sorry for my english i'll try to do my best) : (对不起我的英语,我会尽力而为):

I'm a newbie in python and i'm seeking for help for some web scraping. 我是python的新手,我正在寻求有关网络抓取的帮助。 I already have a functionable code to get the links i want but the website is protected by a password. 我已经有一个可用的代码来获取我想要的链接,但是该网站受密码保护。 with the help of a lot of question i read i managed to get a working code to scrape the website after the login but the links i want are on another page : 在阅读大量问题的帮助下,我在登录后设法获得了有效的代码来刮擦网站,但我想要的链接在另一页上:

the login page is http://fantasy.trashtalk.co/login.php 登录页面为http://fantasy.trashtalk.co/login.php

the landing page (the one i scrape with this code) after login is http://fantasy.trashtalk.co/ 登录后的登录页面(我用此代码抓取的页面)是http://fantasy.trashtalk.co/

and the page i want is http://fantasy.trashtalk.co/?tpl=classement&t=1 我想要的页面是http://fantasy.trashtalk.co/?tpl=classement&t=1

So i have this code (some import are probably useless, they come from another code): 所以我有这段代码(某些导入可能是无用的,它们来自另一个代码):

from bs4 import BeautifulSoup
import requests
from lxml import html
import urllib.request
import re

username = 'myusername'
password = 'mypass'
url = "http://fantasy.trashtalk.co/?tpl=classement&t=1"
log = "http://fantasy.trashtalk.co/login.php"

values = {'email': username,
          'password': password}

r = requests.post(log, data=values)

# Not sure about the code below but it works.
data = r.text

soup = BeautifulSoup(data, 'lxml')

tags = soup.find_all('a')

for link in soup.findAll('a', attrs={'href': re.compile("^https://")}):
    print(link.get('href'))

I understand that this code only allow me to access to the login page then scrape what come next (the landing page) but i don't figure out how to "save" my loggin info to access the page i want to scrape. 我了解此代码仅允许我访问登录页面,然后刮取接下来的内容(登录页面),但我不知道如何“保存”我的登录信息以访问要刮取的页面。

i think i should add something like this after the login code but when i do it it only scrape my links from the login page : 我想我应该在登录代码后添加类似这样的内容,但是当我这样做时,它只会从登录页面抓取我的链接:

s = request.get(url)

Also i read some topic here using "with session" thing ? 我也用“ with session”来阅读一些话题吗? but i didn't managed to make it work. 但是我没有设法使它工作。

Any of help would be appreciated. 任何帮助将不胜感激。 Thank you for your time. 感谢您的时间。

The issue was that you needed to save your login credentials by posting them through a session object, not a request object. 问题是您需要通过通过会话对象而非请求对象发布登录凭据来保存登录凭据。 I've modified your code below and you now have access to the html tags located in the scrape_url page. 我已经在下面修改了您的代码,现在您可以访问scrape_url页面中的html标签。 Good luck! 祝好运!

import requests
from bs4 import BeautifulSoup

username = 'email'
password = 'password'
scrape_url = 'http://fantasy.trashtalk.co/?tpl=classement&t=1'

login_url = 'http://fantasy.trashtalk.co/login.php'
login_info = {'email': username,'password': password}

#Start session.
session = requests.session()

#Login using your authentication information.
session.post(url=login_url, data=login_info)

#Request page you want to scrape.
url = session.get(url=scrape_url)

soup = BeautifulSoup(url.content, 'html.parser')

for link in soup.findAll('a'):
    print('\nLink href: ' + link['href'])
    print('Link text: ' + link.text)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM