简体   繁体   English

Selenium 适用于 AWS EC2 但不适用于 AWS Lambda

[英]Selenium works on AWS EC2 but not on AWS Lambda

I've looked at and tried nearly every other post on this topic with no luck.我已经查看并尝试了几乎所有关于此主题的其他帖子,但都没有成功。

EC2 EC2

I'm using python 3.6 so I'm using the following AMI amzn-ami-hvm-2018.03.0.20181129-x86_64-gp2 (see here ).我使用的是python 3.6所以我使用了以下 AMI amzn-ami-hvm-2018.03.0.20181129-x86_64-gp2 (请参阅此处)。 Once I SSH into my EC2, I download Chrome with:通过 SSH 连接到我的 EC2 后,我使用以下命令下载 Chrome:

sudo curl https://intoli.com/install-google-chrome.sh | bash
cp -r /opt/google/chrome/ /home/ec2-user/
google-chrome-stable --version
# Google Chrome 86.0.4240.198 

And download and unzip the matching Chromedriver:并下载并解压缩匹配的 Chromedriver:

sudo wget https://chromedriver.storage.googleapis.com/86.0.4240.22/chromedriver_linux64.zip
sudo unzip chromedriver_linux64.zip

I install python36 and selenium with:我使用以下命令安装python36selenium

sudo yum install python36 -y
sudo /usr/bin/pip-3.6 install selenium

Then run the script:然后运行脚本:

import os
import selenium
from selenium import webdriver

CURR_PATH = os.getcwd()
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--headless')
chrome_options.add_argument('--window-size=1280x1696')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--hide-scrollbars')
chrome_options.add_argument('--enable-logging')
chrome_options.add_argument('--log-level=0')
chrome_options.add_argument('--v=99')
chrome_options.add_argument('--single-process')
chrome_options.add_argument('--ignore-certificate-errors')
chrome_options.add_argument('--remote-debugging-port=9222')
chrome_options.binary_location = f"{CURR_PATH}/chrome/google-chrome"
driver = webdriver.Chrome(
    executable_path = f"{CURR_PATH}/chromedriver",
    chrome_options=chrome_options
)
driver.get("https://www.google.com/")
html = driver.page_source
print(html)

This works这有效

Lambda拉姆达

I then zip my chromedriver and Chrome files:然后我压缩我的 chromedriver 和 Chrome 文件:

mkdir tmp
mv chromedriver tmp
mv chrome tmp
cd tmp
zip -r9 ../chrome.zip chromedriver chrome

And copy the zipped file to an S3 bucket并将压缩文件复制到S3存储桶

This is my lambda function:这是我的 lambda 函数:

import os
import boto3
from botocore.exceptions import ClientError
import zipfile
import selenium
from selenium import webdriver

s3 = boto3.resource('s3')

def handler(event, context):
    chrome_bucket = os.environ.get('CHROME_S3_BUCKET')
    chrome_key = os.environ.get('CHROME_S3_KEY')
    # DOWNLOAD HEADLESS CHROME FROM S3
    try:    
        # with open('/tmp/headless_chrome.zip', 'wb') as data:
        s3.meta.client.download_file(chrome_bucket, chrome_key, '/tmp/chrome.zip')
        print(os.listdir('/tmp'))
    except ClientError as e:
        raise e
    # UNZIP HEADLESS CHROME
    try:
        with zipfile.ZipFile('/tmp/chrome.zip', 'r') as zip_ref:
            zip_ref.extractall('/tmp')
        # FREE UP SPACE
        os.remove('/tmp/chrome.zip')
        print(os.listdir('/tmp'))
    except:
        raise ValueError('Problem with unzipping Chrome executable')
    # CHANGE PERMISSION OF CHROME
    try:
        os.chmod('/tmp/chromedriver', 0o775)
        os.chmod('/tmp/chrome/chrome', 0o775)
        os.chmod('/tmp/chrome/google-chrome', 0o775)
    except:
        raise ValueError('Problem with changing permissions to Chrome executable')
    # GET LINKS
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--headless')
    chrome_options.add_argument('--window-size=1280x1696')
    chrome_options.add_argument('--disable-gpu')
    chrome_options.add_argument('--disable-dev-shm-usage')
    chrome_options.add_argument('--hide-scrollbars')
    chrome_options.add_argument('--enable-logging')
    chrome_options.add_argument('--log-level=0')
    chrome_options.add_argument('--v=99')
    chrome_options.add_argument('--single-process')
    chrome_options.add_argument('--ignore-certificate-errors')
    chrome_options.add_argument('--remote-debugging-port=9222')
    chrome_options.binary_location = "/tmp/chrome/google-chrome"
    driver = webdriver.Chrome(
        executable_path = "/tmp/chromedriver",
        chrome_options=chrome_options
    )
    driver.get("https://www.google.com/")
    html = driver.page_source
    print(html)

I'm able to see my unzipped files in the /tmp path.我可以在/tmp路径中看到我解压后的文件。

And my error:而我的错误:

{
  "errorMessage": "Message: unknown error: unable to discover open pages\n",
  "errorType": "WebDriverException",
  "stackTrace": [
    [
      "/var/task/lib/observer.py",
      69,
      "handler",
      "chrome_options=chrome_options"
    ],
    [
      "/var/task/selenium/webdriver/chrome/webdriver.py",
      81,
      "__init__",
      "desired_capabilities=desired_capabilities)"
    ],
    [
      "/var/task/selenium/webdriver/remote/webdriver.py",
      157,
      "__init__",
      "self.start_session(capabilities, browser_profile)"
    ],
    [
      "/var/task/selenium/webdriver/remote/webdriver.py",
      252,
      "start_session",
      "response = self.execute(Command.NEW_SESSION, parameters)"
    ],
    [
      "/var/task/selenium/webdriver/remote/webdriver.py",
      321,
      "execute",
      "self.error_handler.check_response(response)"
    ],
    [
      "/var/task/selenium/webdriver/remote/errorhandler.py",
      242,
      "check_response",
      "raise exception_class(message, screen, stacktrace)"
    ]
  ]
}

EDIT: I am willing to try out anything at this point.编辑:此时我愿意尝试任何事情。 Different versions of Chrome or Chromium, Chromedriver, Python or Selenium.不同版本的 Chrome 或 Chromium、Chromedriver、Python 或 Selenium。

EDIT2: The answer below did not solve the problem. EDIT2:下面的答案没有解决问题。

This error message...这个错误信息...

"errorMessage": "Message: unknown error: unable to discover open pages\n",
"errorType": "WebDriverException"

...implies that the ChromeDriver was unable to initiate/spawn a new Browsing Context ie Chrome Browser session. ...暗示ChromeDriver无法启动/生成新的浏览上下文,Chrome 浏览器会话。

It seems the issue is with ChromeDriver ,s security feature of Sandboxing .问题似乎出在ChromeDriver 沙盒安全功能上。


Thumb rule经验法则

A common cause for Chrome to crash during startup is running Chrome as root user ( administrator ) on Linux. Chrome 在启动期间崩溃的一个常见原因是在 Linux 上以root用户( administrator )身份运行 Chrome。 While it is possible to work around this issue by passing --no-sandbox flag when creating your WebDriver session, such a configuration is unsupported and highly discouraged.虽然可以通过在创建 WebDriver 会话时传递--no-sandbox标志来解决此问题,但这种配置不受支持且非常不鼓励。 You need to configure your environment to run Chrome as a regular user instead.您需要将环境配置为以普通用户身份运行 Chrome。


Details细节

A bit of more details about your usecase would have helped us to analyze the usage of the arguments which you have used and the root cause of the error in a better way.有关您的用例的更多详细信息将帮助我们以更好的方式分析您使用的参数的用法以及错误的根本原因。 However, a few thoughts:不过,有几点想法:

  • What is the sandbox? 什么是沙箱? : The sandbox is a C++ library that allows the creation of sandboxed processes — processes that execute within a very restrictive environment. :沙箱是一个 C++ 库,它允许创建沙箱进程——在非常严格的环境中执行的进程。 The only resources sandboxed processes can freely use are CPU cycles and memory.沙盒进程可以自由使用的唯一资源是 CPU 周期和内存。 For example, sandboxes processes cannot write to disk or display their own windows.例如,沙箱进程不能写入磁盘或显示它们自己的窗口。 What exactly they can do is controlled by an explicit policy.他们究竟能做什么是由明确的政策控制的。 Chromium renderers are sandboxed processes. Chromium 渲染器是沙盒进程。
  • What does and doesn't it protect against? 它保护什么,不保护什么? : The sandbox limits the severity of bugs in code running inside the sandbox. :沙箱限制了在沙箱内运行的代码中错误的严重性。 Such bugs cannot install persistent malware in the user's account (because writing to the filesystem is banned).此类错误无法在用户帐户中安装持久性恶意软件(因为禁止写入文件系统)。 Such bugs also cannot read and steal arbitrary files from the user's machine.此类漏洞也无法从用户机器上读取和窃取任意文件。 (In Chromium, the renderer processes are sandboxed and have this protection. After the NPAPI removal, all remaining plugins are also sandboxed. Also note that Chromium renderer processes are isolated from the system, but not yet from the web. Therefore, domain-based data isolation is not yet provided.). (在 Chromium 中,渲染器进程被沙盒化并具有此保护。在 NPAPI 移除后,所有剩余的插件也被沙盒化。另请注意,Chromium 渲染器进程与系统隔离,但尚未与 Web 隔离。因此,基于域的尚未提供数据隔离。)。 The sandbox cannot provide any protection against bugs in system components such as the kernel it is running on.沙箱无法针对系统组件(例如运行它的内核)中的错误提供任何保护。
  • So how can a sandboxed process such as a renderer accomplish anything? 那么像渲染器这样的沙盒进程是如何完成任何事情的呢? : Certain communication channels are explicitly open for the sandboxed processes; :某些通信渠道是明确为沙盒进程开放的; the processes can write and read from these channels.进程可以从这些通道写入和读取。 A more privileged process can use these channels to do certain actions on behalf of the sandboxed process.更高特权的进程可以使用这些通道代表沙盒进程执行某些操作。 In Chromium, the privileged process is usually the browser process.在 Chromium 中,特权进程通常是浏览器进程。

So you may need to drop the --no-sandbox option.因此,您可能需要删除--no-sandbox选项。 Here is the link to the Sandbox story.这是沙盒故事的链接。


Additional Considerations其他注意事项

Some more considerations:还有一些考虑:

  • While using --headless option you won't be able to use --window-size=1280x1696 due to certain constraints.使用--headless选项时,由于某些限制,您将无法使用--window-size=1280x1696

You can find a couple of relevant detailed discussion in:您可以在以下位置找到一些相关的详细讨论:

You can find a relevant detailed discussion in ERROR:gpu_process_transport_factory.cc(1007)-Lost UI shared context : while initializing Chrome browser through ChromeDriver in Headless mode您可以在ERROR:gpu_process_transport_factory.cc(1007)-Lost UI shared context 中找到相关的详细讨论:在 Headless 模式下通过 ChromeDriver 初始化 Chrome 浏览器时

  • Further you haven't mentioned any specific requirement of using --disable-dev-shm-usage , --hide-scrollbars , --enable-logging , --log-level=0 , --v=99 , --single-process and --remote-debugging-port=9222 arguments which you opt to drop for the time being and add them back as per your Test Specification .此外,您还没有提到使用--disable-dev-shm-usage--hide-scrollbars--enable-logging--log-level=0--v=99--single-process任何具体要求--single-process--remote-debugging-port=9222参数,您暂时选择删除它们,然后根据您的测试规范将它们添加回来。

References参考

You can find a couple of relevant detailed discussion in:您可以在以下位置找到一些相关的详细讨论:

I was finally able to get it to work我终于能够让它工作

Python 3.7
selenium==3.14.0
headless-chromium v1.0.0-55
chromedriver 2.43

Headless-Chromium无头铬

https://github.com/adieuadieu/serverless-chrome/releases/download/v1.0.0-55/stable-headless-chromium-amazonlinux-2017-03.zip https://github.com/adieuadieu/serverless-chrome/releases/download/v1.0.0-55/stable-headless-chromium-amazonlinux-2017-03.zip

Chromedriver铬驱动程序

https://chromedriver.storage.googleapis.com/2.43/chromedriver_linux64.zip https://chromedriver.storage.googleapis.com/2.43/chromedriver_linux64.zip

I added headless-chromium and chromedriver to a Lambda Layer我在Lambda Layer添加了 headless-chromium 和 chromedriver

Permissions 755 for both works两部作品的权限755

Lambda拉姆达

The Lambda function looks like this Lambda 函数如下所示

import os
import selenium
from selenium import webdriver


def handler(event, context):
    print(os.listdir('/opt'))
    # 
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--headless')
    chrome_options.add_argument('--single-process')
    chrome_options.add_argument('--disable-dev-shm-usage')
    chrome_options.binary_location = f"/opt/headless-chromium"
    driver = webdriver.Chrome(
        executable_path = f"/opt/chromedriver",
        chrome_options=chrome_options
    )
    driver.get("https://www.google.com/")
    html = driver.page_source
    driver.close()
    driver.quit()
    print(html)

Hope this helps someone in Q4 2020 and after.希望这对 2020 年第四季度及之后的人有所帮助。

The answer of @CPak worked for me, I only had to copy the headless-chromium and chromedriver to /tmp and grant permissions, the rest of the code is the same: @CPak 的答案对我有用,我只需要将headless-chromiumchromedriver/tmp并授予权限,其余代码相同:

from shutil import copyfile

def permissions(origin_path, destiny_path):
    copyfile(origin_path, destiny_path)
    os.chmod(destiny_path, 0o775)

    
def lambda_handler(event, context):
    permissions('/opt/chromedriver','/tmp/chromedriver')
    permissions('/opt/headless-chromium','/tmp/headless-chromium')

I'm a big fan of this answer because a few months ago allows me to properly run a serverless scraper on AWS Lambda.我是这个答案的忠实粉丝,因为几个月前允许我在 AWS Lambda 上正确运行无服务器抓取工具。 But a few days ago this implementation began to fail, and traveling for hours and hours of searching I got to the conclusion that the binaries given here by @CPak (for chrome version 69) are too old to run on "modern" websites.但是几天前,这个实现开始失败,经过数小时的搜索,我得出的结论是@CPak 在这里给出的二进制文件(对于 chrome 版本 69)太旧了,无法在“现代”网站上运行。

I found in this GitHub repo a file called chromium.zip , which is the headless-chromium binary for version 86.0.4240.0.我在这个GitHub 存储库中找到了一个名为chromium.zip的文件,它是版本 86.0.4240.0 的无头铬二进制文件。 And here I downloaded the matching chromedriver. 在这里,我下载了匹配的 chromedriver。 With these two files replacing the @Cpak answer or mine given previously the implementation should work.用这两个文件替换之前给出的@Cpak 答案或我的答案,实现应该可以工作。

I'm still trying to find where to obtain the most recent versions of the headless chromium binaries when these versions stopped working.当这些版本停止工作时,我仍在尝试找到从哪里获取最新版本的无头铬二进制文件。 When I find it it'll post here.当我找到它时,它会张贴在这里。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM