简体   繁体   English

selenium phantomjs无法刮取网站机器人检测

[英]selenium phantomjs can't scrape a website bot detection

I can't scrape this site here is a screenshot of the request on python selenium phantomjs. 我无法抓住这个网站这里是python selenium phantomjs请求的截图。 I dont know how they detected it was a bot but says on the picture need javascript and need captcha and maybe what other things needed to? 我不知道他们是如何检测到它是一个机器人,但在图片上说需要javascript,需要验证码,也许还需要其他什么东西? Definitely Im not scraping at superhuman speed because it is my first request so it was not the cause. 绝对不是因为这是我的第一次请求而不是超人速度,所以这不是原因。 PS when I paste the same request on my browser it directs to the page that i want and works okay. PS当我在浏览器上粘贴相同的请求时,它指向我想要的页面并且工作正常。

    br = webdriver.PhantomJS('bin/phantomjs')
    br.set_window_size(1366, 200)
    br.get("website")
    br.save_screenshot(x)

Well I got it working now. 好吧,我现在就开始工作了。 I'll simply put this for the sake of other people who doesn't. 我会简单地把这个放在其他不这样做的人身上。 enable javascript and fake useragent 启用javascript和假的useragent

    cap = webdriver.DesiredCapabilities.PHANTOMJS
    cap["phantomjs.page.settings.javascriptEnabled"] = True
    cap["phantomjs.page.settings.loadImages"] = True
    cap["phantomjs.page.settings.userAgent"] = 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'
    br = webdriver.PhantomJS('bin/phantomjs',desired_capabilities=cap)

Things that can help in general : 一般可以帮助的事情:

  • Headers should be similar to common browsers, including : 标题应与常见浏览器类似,包括:
  • Navigation : 导航:
    • If you make multiple request, put a random timeout between them 如果您发出多个请求,请在它们之间添加一个随机超时
    • If you open links found in a page, set the Referer header accordingly 如果打开页面中的链接,请相应地设置Referer标头
    • Or better, simulate mouse activity to move, click and follow link 或者更好,模拟鼠标活动移动,点击并关注链接
  • Images should be enabled 应启用图像
  • Javascript should be enabled 应该启用Javascript
    • Check that " navigator.plugins " and " navigator.language " are set in the client javascript page context 检查在客户端javascript页面上下文中是否设置了“ navigator.plugins ”和“ navigator.language
    • Check that the client you use does not inject noticeable javascript variables (like _cdc, __nightmare...) 检查您使用的客户端是否没有注入明显的javascript变量(如_cdc,__ nightmare ......)
  • Use proxies 使用代理

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM