简体   繁体   中英

Python scraping HTTPError: 403 Client Error: Forbidden for url:

My python code used to work, but when I tried it today it did not work anymore. I assume the website owner forbade non browsers requests recently.

code

import requests, bs4
res = requests.get('https://manga1001.com/日常-raw-free/')
res.raise_for_status()
print(res.text)

I read that adding header in the requests.get method may work, but I don't know which header info exactly I need to make it work.

error

---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
<ipython-input-15-ed1948d83d51> in <module>
      3 # res = requests.get('https://manga1001.com/日常-raw-free/', headers=headers_dic)
      4 res = requests.get('https://manga1001.com/日常-raw-free/')
----> 5 res.raise_for_status()
      6 print(res.text)
      7 

~/opt/anaconda3/lib/python3.8/site-packages/requests/models.py in raise_for_status(self)
    939 
    940         if http_error_msg:
--> 941             raise HTTPError(http_error_msg, response=self)
    942 
    943     def close(self):

HTTPError: 403 Client Error: Forbidden for url: https://manga1001.com/%E6%97%A5%E5%B8%B8-raw-free/

Requests get a header argument

res = requests.get('https://manga1001.com/日常-raw-free/', headers="")

I think adding a proper value here could make it work, but I don't know what the value is. I would really appreciate if you could tell you. And if you know any other ways to make it work, that is also quite helpful.

Btw I have also tried the code below but it also didn't work.

code 2

from requests_html import HTMLSession

url = "https://search.yahoo.co.jp/realtime"


session = HTMLSession()
r = session.get(url)

r = r.html.render()

print(r)

FYI HTMLSession may not work on IDLE like Jupyter notebook so I tired it after saving it as a python file but it still did not work.

When I run first code without res.raise_for_status() then I can see in HTML with Why do I have to complete a CAPTCHA? and Cloudflare Ray ID which shows what is the problem. It uses Cloudflare to detect scripts/bots/hackers/spamers and it uses Captcha to check it. But if I use header 'User-Agent' with value from real browser or even with short 'Mozilla/5.0' then it get expected page.

It works for me with both pages.

import requests

headers = {
    'User-Agent': 'Mozilla/5.0'
}

url = 'https://manga1001.com/日常-raw-free/'
#url = 'https://search.yahoo.co.jp/realtime'

res = requests.get(url, headers=headers)

print('status_code:', res.status_code)
print(res.text)

BTW:

If you will run it often for many links in short time then it may display again CAPTCHA and then you may need other methods to behave more like real human - ie. sleep() with random time, Session() to use cookies , first get main page (to get fresh cookies ) and later get this page, add other headers.

I wanted to expand on the answer given by @Furas because I understand his fix will not be the solution in all cases. Yes, In this instance you're getting the 403 and Cloudflare/security captcha page when you make a request because of not "scoring" high enough on the security system (Your HTTP browser isn't similar enough to a real browser )

This creates a big question. What is a real browser and what score do I need to beat it? How do I increase my browser score and make my HTTP-request based browser look more real to the bot protection?

Firstly, it's important to understand that these 403/Security blocks are based on different levels on security. Something you do on one site may not work on the other due to different security configurations/version. Two sites may use the same security system and still the request you make may only work on one.

Why would they have different configurations and everyone not use the highest security available? Because with each additional security measure, there's more false-positives and challenges to pass, on a large scale or for an e-commerce store this can mean lost sales due to a poor user-experience or additional bugs/downtime which are introduced via the security program.

What is a real browser? A real browser can perform SSL/TLS handshakes, parse and run Javascript and make TCP/requests. Along with this, the security programs will analyze the patterns and timings of everything from Layer 2 to see if you're a "real" human. When you use something like Python to make a request that is only performing a HTTP(s) request it's really easy for these security programs to recognise you as a bot without some heavy configuration.

One way that security systems combat bots is by putting a Javascript challenge as a proxy between the bot and a site, this requires running client-side Javascript which bots cannot do by default, not only do you need to run the client-side Javascript, it also needs to be similar to one that your own browser would generate, the challenge can typically consist of a few hundred individual "browser" challenges or/along with a manual captcha to fingerprint and track your browser to see if you're a human (This is the page you're seeing).

The typical and more lower-standard security systems/configurations can be beaten by using the correct headers (with capitalization, header order and HTTP versions. Like @Furas mentioned, using consistent sessions can also help create longer-lasting sessions before getting another 403. More advanced and higher-level security configurations can do tracking on lower-levels by looking at some flags (Such as WindowSize) of the TCP connection and JA3 fingerprinting analyzing the TLS handshake which will look at your cipher suites and ALPN amongst other things. Security systems can see characteristics which differentiate between browsers, browser-versions and operating systems and compare these all together to generate your realness score. Your IP can also be an important factor, requests can be cross-checked with other sites, intervals, older requests you tried before and much more, you can use proxies to divide your requests between and look less suspicious, but this can come with additional problems and affect your request also causing it to be fingerprinted and blocked.

To understand this better, here's a great site you can go to in your browser and also make a GET request to, check your browser "Rank" and look at the different values which can be seen just from the TLS request alone.

I hope this provides some insight into why a block might appear, although it's impossible to tell from a single URL since blocks can appear for such a variety of different reasons.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM