I am trying to access a site with a bot prevention.
with the following script using requests I can access the site.
request = requests.get(url,headers={**HEADERS,'Cookie': cookies})
and I am getting the desired HTML. but when I use aiohttp
async def get_data(session: aiohttp.ClientSession,url,cookies):
async with session.get(url,timeout = 5,headers={**HEADERS,'Cookie': cookies}) as response:
text = await response.text()
print(text)
I am getting as a response the bot prevention page.
This is the headers I use for all the requests.
HEADERS = {
'User-Agent': 'PostmanRuntime/7.29.0',
'Host': 'www.dnb.com',
'Connection': 'keep-alive',
'Accept': '/',
'Accept-Encoding': 'gzip, deflate, br'
}
I have compared the requests headers both of requests.get and aiohttp and they are identical.
is there any reason the results are different? if so why?
EDIT: I've checked the httpx module, the problem occurs there aswell both with httpx.Client()
and httpx.AsyncClient()
.
response = httpx.request('GET',url,headers={**HEADERS,'Cookie':cookies})
doesn't work as well. (not asyncornic)
I tried capturing packets with wireshark to compare requests and aiohttp.
Server:
import http
server = http.server.HTTPServer(("localhost", 8080),
http.server.SimpleHTTPRequestHandler)
server.serve_forever()
with requests:
import requests
url = 'http://localhost:8080'
HEADERS = {'Content-Type': 'application/json'}
cookies = ''
request = requests.get(url,headers={**HEADERS,'Cookie': cookies})
requests packet:
GET / HTTP/1.1
Host: localhost:8080
User-Agent: python-requests/2.27.1
Accept-Encoding: gzip, deflate, br
Accept: */*
Connection: keep-alive
Content-Type: application/json
Cookie:
with aiohttp:
import aiohttp
import asyncio
url = 'http://localhost:8080'
HEADERS = {'Content-Type': 'application/json'}
cookies = ''
async def get_data(session: aiohttp.ClientSession,url,cookies):
async with session.get(url,timeout = 5,headers={**HEADERS,'Cookie': cookies}) as response:
text = await response.text()
print(text)
async def main():
async with aiohttp.ClientSession() as session:
await get_data(session,url,cookies)
asyncio.run(main())
aiohttp packet:
GET / HTTP/1.1
Host: localhost:8080
Content-Type: application/json
Cookie:
Accept: */*
Accept-Encoding: gzip, deflate
User-Agent: Python/3.10 aiohttp/3.8.1
If the site seems to accept packets from requests, then you could try making the aiohttp packet identical by setting the headers:
HEADERS = { 'User-Agent': 'python-requests/2.27.1','Accept-Encoding': 'gzip, deflate, br', 'Accept': '*/*', 'Connection': 'keep-alive', 'Content-Type': 'application/json','Cookie': ''}
If you haven't already, I suggest capturing the request with wireshark to make sure aiohttp isn't messing with your headers.
You can also try other user agent strings too, or try the headers in different orders. The order is not supposed to matter, but some sites check it anyway for bot protection (for example in this question ).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.