使用 BeautifulSoup python 访问站点时访问被拒绝 [403]

Question

我想使用 BeautifulSoup 抓取https://www.jdsports.it/但访问被拒绝。

在我的电脑上访问该站点没有任何问题，我使用的是 Python 程序的相同用户代理，但在该程序上结果不同，您可以在下面看到 output。

编辑：我想我需要 cookies 才能访问该站点。 我怎样才能得到它们并使用它们访问带有 python 程序的站点来抓取它？

-如果我使用“ https://www.jdsports.com ”，脚本就可以工作，这是同一个站点，但区域不同。

谢谢！

import time
import requests
from bs4 import BeautifulSoup
import smtplib

headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}

url = 'https://www.jdsports.it/'

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
soup.encode('utf-8')

status = soup.findAll.get_text()
print (status)

output 是：

<html><head>
<title>Access Denied</title>
</head><body>
<h1>Access Denied</h1>

You don't have permission to access "http://www.jdsports.it/" on this server.<p>
Reference #18.35657b5c.1589627513.36921df8
</p></body>
</html>
>

python beautifulsoup 用户代理 cookies python-requests

Answer 1

起初怀疑是 HTTP/2，但也无法正常工作。 也许你更幸运，这是一个 HTTP/2 起点：

import asyncio
import httpx
import logging

logging.basicConfig(format='%(message)s', level=logging.DEBUG)
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
}
url = 'https://www.jdsports.it/'
async def f():
    client = httpx.AsyncClient(http2=True)
    r = await client.get(url, allow_redirects=True, headers=headers)
    print(r.text)

asyncio.run(f())

（同时在 Windows 和 Linux 上进行了测试。）这与 TLS1.2 有关系吗？ 这就是我下一步要看的地方，因为curl有效。

使用 BeautifulSoup python 访问站点时访问被拒绝 [403]

问题描述

1 个解决方案

解决方案1
0 2020-05-17 11:34:12

使用 BeautifulSoup python 访问站点时访问被拒绝 [403]

问题描述

1 个解决方案

解决方案1 0 2020-05-17 11:34:12

解决方案1
0 2020-05-17 11:34:12