简体   繁体   English

heroku 上的机器人 - 由于验证码无法废弃网站,即使我的电脑上一切正常

[英]Bot on heroku - unable to scrap sites because of captcha even though everything works on my pc

I have simple bot on heroku which works with discord and scraps sites.我在 heroku 上有一个简单的机器人,它与 discord 和废弃站点一起工作。 Normally i use reuqests module to scrap sites, i get page source and that's all.通常我使用reuqests模块来废弃网站,我得到页面源代码,仅此而已。 (note: bot doesn't spam ping sites, only once per day/week, also site i'm pinging is epicgames, but it's not the only one with captcha) . (注意:机器人不会发送垃圾邮件 ping 站点,每天/每周只发送一次,我正在 ping 的站点也是史诗游戏,但它不是唯一一个带有验证码的站点)


But later i discovered that i get captcha protection in my page source, so i decided to use chromedriver.但后来我发现我的页面源代码中有验证码保护,所以我决定使用 chromedriver。 After setting up chromedriver on heroku, i still got captcha protection on sites.在 heroku 上设置 chromedriver 后,我仍然在网站上获得验证码保护。 On my pc it worked completely fine even without any options below, it never asked for captcha verification.在我的电脑上,即使没有以下任何选项,它也能正常工作,它从未要求验证码验证。

So this is what i tried: (note: i use undetected chromedriver - optimized version of selenium chromedriver)所以这就是我尝试的:(注意:我使用 未检测到的 chromedriver - selenium chromedriver 的优化版本)


1. In page source it asked for JavaScript to be enabled, so i added chromedriver option 1.在页面源代码中,它要求启用JavaScript ,所以我添加了 chromedriver 选项

import undetected_chromedriver as webdriver

opts = uc.ChromeOptions()
opts.add_argument("--enable-javascript")
driver = uc.Chrome(use_subprocess=True, options=opts)

driver.get(url)
print(driver.page_source)

Still showed captcha verification, but now without JavaScript error.仍然显示验证码验证,但现在没有 JavaScript 错误。


2. After doing some research, i discovered heroku IP might be on some sort of block list so i was suggested to add proxy to chromedriver options 2.在做了一些研究之后,我发现 heroku IP 可能在某种阻止列表中,所以建议我将代理添加到 chromedriver 选项

import undetected_chromedriver as webdriver

opts = uc.ChromeOptions()
opts.add_argument("--enable-javascript")
opts.add_argument(f'--proxy-server=socks5://hostip:port')
driver = uc.Chrome(use_subprocess=True, options=opts)

driver.get(url)
print(driver.page_source)

3. I found similar option to the second one which seemed to work for other, but still site showed captcha verification 3.我发现与第二个类似的选项似乎适用于其他选项,但站点仍然显示验证码

import undetected_chromedriver as webdriver
import os
import shutil
import tempfile

class ProxyExtension:
    manifest_json = """
    {
        "version": "1.0.0",
        "manifest_version": 2,
        "name": "Chrome Proxy",
        "permissions": [
            "proxy",
            "tabs",
            "unlimitedStorage",
            "storage",
            "<all_urls>",
            "webRequest",
            "webRequestBlocking"
        ],
        "background": {"scripts": ["background.js"]},
        "minimum_chrome_version": "76.0.0"
    }
    """

    background_js = """
    var config = {
        mode: "fixed_servers",
        rules: {
            singleProxy: {
                scheme: "http",
                host: "%s",
                port: %d
            },
            bypassList: ["localhost"]
        }
    };

    chrome.proxy.settings.set({value: config, scope: "regular"}, function() {});

    function callbackFn(details) {
        return {
            authCredentials: {
                username: "%s",
                password: "%s"
            }
        };
    }

    chrome.webRequest.onAuthRequired.addListener(
        callbackFn,
        { urls: ["<all_urls>"] },
        ['blocking']
    );
    """

    def __init__(self, host, port, user, password):
        self._dir = os.path.normpath(tempfile.mkdtemp())

        manifest_file = os.path.join(self._dir, "manifest.json")
        with open(manifest_file, mode="w") as f:
            f.write(self.manifest_json)

        background_js = self.background_js % (host, port, user, password)
        background_file = os.path.join(self._dir, "background.js")
        with open(background_file, mode="w") as f:
            f.write(background_js)

    @property
    def directory(self):
        return self._dir

    def __del__(self):
        shutil.rmtree(self._dir)


if __name__ == "__main__":
    proxy = ("hostip", port, "username", "pass")
    proxy_extension = ProxyExtension(*proxy)

    options = uc.ChromeOptions()
    options.add_argument("--enable-javascript")
    options.add_argument(f"--load-extension={proxy_extension.directory}")
    driver = uc.Chrome(use_subprocess=True, options=options)

Also i've tried options like adding --headless option, changing agent to firefox, adding nogpu option and etc.我也尝试过添加 --headless 选项、将代理更改为 firefox、添加 nogpu 选项等选项。

I've been trying to fix this for a month, now I hope someone knows answer to my problem.我一直在尝试解决这个问题一个月,现在我希望有人知道我的问题的答案。

You are likely receiving the captcha due to Heroku having a datacenter ip and probably being flagged or something similar.您可能会收到验证码,因为 Heroku 具有数据中心 ip 并且可能被标记或类似的东西。 You have a couple of options you could try using a residential proxy and hope its not flagged and you don't get a captcha or you could pay for a captcha solution like 2Captcha or Capmonster .您有几个选项可以尝试使用住宅代理并希望它没有被标记并且您没有获得验证码,或者您可以支付验证码解决方案,如2CaptchaCapmonster Not sure exactly what type of captcha you are getting but both support reCaptcha.不确定您获得的是哪种类型的验证码,但两者都支持 reCaptcha。 The 2Captcha Docs have a lot of good information for submitting the captcha once you solve it. 2Captcha Docs有很多很好的信息,用于在您解决验证码后提交验证码。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如果我通过我的电脑启动它,我的 heroku 音乐机器人工作正常,但是当我通过 heroku 运行它时它不起作用 - My heroku music bot works fine if i launch it through my pc yet when i run it through heroku it doesn't work CodeHS 使我的程序失败,即使它完全有效 - CodeHS fails my program even though it fully works 我的Kivy android应用程序即使在我的PC上运行得很好也仍然在启动时崩溃 - My Kivy android app keeps crashing on start up even though it runs perfectly fine my PC 即使我返回了所有内容,我的代码的递归似乎也不起作用 - Recursion for my code doesn't seem to work even though I am returning everything NoReverseMatch 错误,即使一切都应该设置 - NoReverseMatch Error, even though everything should be set 我在 Heroku Git 上托管了一个 discord.py bot,即使我在网站上打开它,它也不会在 discord 服务器上上线 - I am hosting a discord.py bot on Heroku Git and it wont go online on the discord server even though I turned it on in the website Heroku显示我的Django应用程序的应用程序错误,即使它在我的本地计算机上完美运行 - Heroku shows Application Error for my Django app even though it runs perfectly in my local machine 我的脚本适用于一台电脑,但不适用于另一台电脑 - my Script works on one pc, but not on another pc 即使代码很好,机器人也停止工作 - Bot stopped working even though code is fine 为什么我的 pygame 程序即使使用出色的 PC 和其他代码也能正常工作 - Why is my pygame program laggy even with a great pc and other code works fine
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM