[英]Powershell Invoke-WebRequest works but Python Requests does not
This is about a weird situation where the Powershell Invoke-WebRequest works as intended and the Python Requests does not.这是关于 Powershell Invoke-WebRequest 按预期工作而 Python Requests 不工作的奇怪情况。
I am trying to scrape a ecommerce site using python. Part of the scraping is to test if an item can be added to cart.我正在尝试使用 python 抓取电子商务网站。部分抓取是为了测试是否可以将商品添加到购物车。 Using the Chrome Developer tools F12, I was able to extract the following Powershell scripts.
使用 Chrome 开发者工具 F12,我能够提取以下 Powershell 脚本。
Step 1 - Request a customer session第 1 步 - 请求客户 session
$session = New-Object Microsoft.PowerShell.Commands.WebRequestSession
$session.UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36"
$secPasswd=ConvertTo-SecureString "password" -AsPlainText -Force
$myCreds=New-Object System.Management.Automation.PSCredential -ArgumentList "username",$secPasswd
Invoke-WebRequest -UseBasicParsing -Uri "https://bck.hermes.com/customer-session?locale=de_de" `
-Proxy 'http://proxyaddress' `
-ProxyCredential $mycreds `
-WebSession $session `
-Headers @{
"sec-ch-ua"="`" Not A;Brand`";v=`"99`", `"Chromium`";v=`"99`", `"Google Chrome`";v=`"99`""
"Accept"="application/json, text/plain, */*"
"Cache-Control"="no-cache"
"DNT"="1"
"sec-ch-ua-mobile"="?0"
"sec-ch-ua-platform"="`"Windows`""
"Origin"="https://www.hermes.com"
"Sec-Fetch-Site"="same-site"
"Sec-Fetch-Mode"="cors"
"Sec-Fetch-Dest"="empty"
"Referer"="https://www.hermes.com/"
"Accept-Encoding"="gzip, deflate, br"
"Accept-Language"="en-US,en;q=0.9,ja;q=0.8,zh-CN;q=0.7,zh-TW;q=0.6,zh;q=0.5"
} | Select-Object -Expand RawContent
The response would give me a "ECOM_SESS" cookie along with a bunch others.响应将给我一个“ECOM_SESS”cookie 以及其他一些 cookie。
I would then pass the ECOM_SESS cookie to the next step.然后我会将 ECOM_SESS cookie 传递到下一步。
Step 2 - add to cart第 2 步 - 添加到购物车
$session = New-Object Microsoft.PowerShell.Commands.WebRequestSession
$session.UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36"
$session.Cookies.Add((New-Object System.Net.Cookie("ECOM_SESS", "XXXXXXXXXXXXXXXX", "/", ".hermes.com")))
$secPasswd=ConvertTo-SecureString "password" -AsPlainText -Force
$myCreds=New-Object System.Management.Automation.PSCredential -ArgumentList "username",$secPasswd
Invoke-WebRequest -UseBasicParsing -Uri "https://bck.hermes.com/add-to-cart" `
-Proxy 'http://proxyaddress' `
-ProxyCredential $mycreds `
-Method "POST" `
-WebSession $session `
-Headers @{
"sec-ch-ua"="`" Not A;Brand`";v=`"99`", `"Chromium`";v=`"99`", `"Google Chrome`";v=`"99`""
"Accept"="application/json, text/plain, */*"
"DNT"="1"
"sec-ch-ua-mobile"="?0"
"sec-ch-ua-platform"="`"Windows`""
"Origin"="https://www.hermes.com"
"Sec-Fetch-Site"="same-site"
"Sec-Fetch-Mode"="cors"
"Sec-Fetch-Dest"="empty"
"Referer"="https://www.hermes.com/"
"Accept-Encoding"="gzip, deflate, br"
"Accept-Language"="en-US,en;q=0.9,ja;q=0.8,zh-CN;q=0.7,zh-TW;q=0.6,zh;q=0.5"
} `
-ContentType "application/json" `
-Body "{`"locale`":`"de_de`",`"items`":[{`"category`":`"direct`",`"sku`":`"H079082CCAC`"}]}"
With the Powershell script above, the process works perfectly and I would get responses from each of the two steps.使用上面的 Powershell 脚本,该过程完美运行,我会从两个步骤中的每一个步骤中得到响应。 Note this is with a rotating IP proxy which refreshes the IP on each request to prevent bot detection.
请注意,这是一个旋转的 IP 代理,它会在每次请求时刷新 IP 以防止机器人检测。
However, when I tried to integrate this into my Python code, I would encounter the requirement of captcha upon Step 2, irrespective of the proxy server used.但是,当我尝试将其集成到我的 Python 代码中时,无论使用何种代理服务器,我都会在第 2 步遇到验证码要求。
Here is the relevant python code:这是相关的 python 代码:
from __future__ import print_function
import bs4
import requests
from requests.cookies import RequestsCookieJar
import jsons
def main():
url1= "https://bck.hermes.com/customer-session?locale=de_de"
url2 = "https://bck.hermes.com/add-to-cart"
proxies1 = {
"http": "xxxxxxxxxxxxxxxxxx"
}
headers1 = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36',
'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="99", "Google Chrome";v="99"',
'Accept': 'application/json, text/plain, */*',
'Cache-Control': 'no-cache',
'DNT': '1',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'Origin': 'https://www.hermes.com',
'Sec-Fetch-Site': 'same-site',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Dest': 'document',
'Referer': 'https://www.hermes.com/',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9,ja;q=0.8,zh-CN;q=0.7,zh-TW;q=0.6,zh;q=0.5'
}
headers2 = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36',
'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="99", "Google Chrome";v="99"',
'Accept': 'application/json, text/plain, */*',
'DNT': '1',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'Origin': 'https://www.hermes.com',
'Sec-Fetch-Site': 'same-site',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Dest': 'empty',
'Referer': 'https://www.hermes.com/',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9,ja;q=0.8,zh-CN;q=0.7,zh-TW;q=0.6,zh;q=0.5'
}
body2 = {"locale":"de_de","items":[{"category":"direct","sku":"H079082CCAC"}]}
#Step 1
f = requests.get(url1, headers=headers1,proxies=proxies1)
print(f"1Response Body: {f.text}\n")
ECOM_SESS = f.cookies['ECOM_SESS']
cookieJar = RequestsCookieJar()
cookieJar.set('ECOM_SESS', ECOM_SESS, domain='.hermes.com', path='/')
#Step 2
g = requests.post(url2, headers=headers2,cookies=cookieJar,proxies=proxies1,json=body2)
print(f"2Response Body: {g.text}\n")
if __name__ == '__main__':
main()
Running the Python code here, Step 1 would nicely give the intended response with the cookies needed to pass onto Step 2. However, Step 2 would always result in a captcha response.在这里运行 Python 代码,第 1 步会很好地给出预期的响应,其中 cookies 需要传递到第 2 步。但是,第 2 步总是会产生验证码响应。
I am just curious as to the difference between the Powershell Invoke-WebRequest method and the Python Requests method, as there has to be something fundamentally different for the former to avoid captcha completely and the latter to always get hit with captcha.我只是好奇 Powershell Invoke-WebRequest 方法和 Python Requests 方法之间的区别,因为前者必须有一些根本不同的东西才能完全避免验证码,而后者总是被验证码击中。
Would appreciate any thoughts and insights from you guys!感谢你们的任何想法和见解! Thanks!
谢谢!
I'm not sure specifically what it is about requests that's triggering the bot protection on the site, but based on this you might have luck using:我不确定具体是什么请求触发了网站上的机器人保护,但基于此,您可能会幸运地使用:
requests.request("POST", url2, headers=headers2, cookies=cookieJar, proxies=proxies1, json=body2)
Alternatively you could try urllib3 instead of Requests.或者,您可以尝试使用 urllib3而不是 Requests。
Here's your powershell script simplified too just as an excercise.这是您的 powershell 脚本,也作为练习进行了简化。
$secPasswd=ConvertTo-SecureString "password" -AsPlainText -Force
$myCreds=New-Object System.Management.Automation.PSCredential -ArgumentList "username",$secPasswd
$headers = @{
"sec-ch-ua"='" Not A;Brand";v="99", "Chromium";v="99", "Google Chrome";v="99"'
"DNT"="1"
"sec-ch-ua-mobile"="?0"
"sec-ch-ua-platform"="`"Windows`""
"Origin"="https://www.hermes.com"
"Sec-Fetch-Site"="same-site"
"Sec-Fetch-Mode"="cors"
"Sec-Fetch-Dest"="empty"
"Referer"="https://www.hermes.com/"
}
Invoke-WebRequest -UseBasicParsing -Uri "https://bck.hermes.com/customer-session?locale=de_de" `
-Proxy 'http://proxyaddress' `
-ProxyCredential $mycreds `
-SessionVariable session `
-Headers $headers
Invoke-WebRequest -UseBasicParsing -Uri "https://bck.hermes.com/add-to-cart" `
-Proxy 'http://proxyaddress' `
-ProxyCredential $mycreds `
-Method POST `
-WebSession $session `
-Headers $headers `
-ContentType "application/json" `
-Body "{`"locale`":`"de_de`",`"items`":[{`"category`":`"direct`",`"sku`":`"H079082CCAC`"}]}"
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.