简体   繁体   中英

How to get a raw data from Pastebin with password?

I want to get a raw data using password from certain locked pastebin link with python. I can't figure out what to do.

Is it impossible to get pastebin raw data using python's requests module and post method? I tried it as below code but it returns error.

url = "https://pastebin.com/URL"
pass_data = {'PostPasswordVerificationForm[password]': 'password'}
res = requests.post(url, headers=headers, data = pass_data) 
text = res.text
print(text)  

It returns below error:

raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='pastebin.com', port=443): 
Max retries exceeded with url: /URL (Caused by SSLError(SSLCertVerificationError
(1, '[SSL: CERTIFICATE_VERIFY_FAILED]certificate verify failed: 
self signed certificate in certificate chain (_ssl.c:1123)')))

Can someone please tell me which one I can use?

Note: Consider using Pastebin's API and Pastebin's scraping API .

Your certificate verification failed (proxy/tor/vpn/web without cert/misconfigured web?). If you still want to proceed, simply use verify=False as an argument for the requests.post() :

requests.post(url="...", verify=False)

If you are using a VPN, perhaps you've been provided with a root certificate for your machine and you can apply it with cert=("path to cert", "path to key") .

If you are using Tor, better skip that circuit and re-create a new one.

For proxy, it's complicated and can be either cert issue or just being plainly misconfigured/broken.

You can verify there's no proxy used by checking your.network sessings (OS specific) and environment variables requests package works with :

  • http_proxy
  • HTTP_PROXY
  • https_proxy
  • HTTPS_PROXY
  • curl_ca_bundle

Edit: I've just re-checked Pastebin, the RAW text option is only available for the unprotected pastes. However, you can get to the HTML version by inspecting the traffic, then re-assembling it with code simply by keeping the session, checking cookies and headers in the.network tab. You should get something like this:

import requests as r
ses = r.Session()
cookie = ses.get("https://pastebin.com").cookies["_csrf-frontend"]
# The missing step here is reworking the provided CSRF by client-side
# JS which is "hidden" in the minified jquery.min.js (or at least the
# `POST` is issued by it). Once you have it, you can put it to the
# data field
print(ses.post(
    url='https://pastebin.com/<your paste>',
    headers={
        'User-Agent': "<user agent to spoof it's via Requests>",
        'Accept': (
            'text/html'
            ',application/xhtml+xml'
            ',application/xml'
            ';q=0.9,image/webp,*/*;q=0.8'
        ),
        'Accept-Language': 'en-US,en;q=0.5',
        'Content-Type': 'application/x-www-form-urlencoded'
    },
    data=(
        '_csrf-frontend=<JS-manipulated CSRF value>'
        '&is_burn=1'
        '&PostPasswordVerificationForm%5Bpassword%5D=<pass>'
    )
).text)

Afterwards just check for the tag with RAW in it and then parse it either by some quick regex ( obligatory "it's a stupid idea" post ) or use a less error-prone solution such as BeautifulSoup .

Nevertheless, captchas, IP blacklisting, "clever" CSRF handling and similar stuff will eventually prevent you from such scraping and if not it's just too easy to assemble an application that will dynamically change its class names, tag names, etc in Angular just to mess with your scraping for the lulz (Google Docs love this stuff, personal experience), so if you intend to do something serious with it, just use the API.

Edit2: Minor FAQ for scraping / why to use the API

  • If the website doesn't allow scraping or forbids it in its ToS you should not be doing it. Although people ignore it mostly, it's not smart to do it from a non-anon device/IP especially if there's a an idea of making money out of it because then people start looking (even legally).
  • No, Tor will not work, especially because it's full of captchas once in there.
  • Yes, anyone who is at least a bit capable of reading server logs can figure out what you'll be doing and block you by IP, User-Agent or just mess with you by serving random data (did that, was quite fun to see the traffic logs later on:D )
  • Yes, even VPNs and proxies can be blocked, just like with Tor only they'll log the activity and make you pay
  • Once Pastebin changes any part of the scraped flow you can start re-inventing it from scratch

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM