简体   繁体   English

为什么 Selenium 和 geckodriver 在使用 airflow 测试运行时可以工作,但在 DAG 运行中运行时会引发错误?

[英]Why does Selenium and geckodriver work when run using airflow test but raises an error when running inside a DAG run?

I'm running a web scraping scripts using Python, Selenium and Geckodriver.我正在使用 Python、Selenium 和 Geckodriver 运行 web 抓取脚本。 The problem is: when I run a task test using airflow test scrap_dag scrap_data 2020-01-01 everything works just fine and the file I want is downloaded correctly.问题是:当我使用airflow test scrap_dag scrap_data 2020-01-01运行任务测试时,一切正常,并且我想要的文件已正确下载。 However, when I trigger a DAG run in Airflow web it fails to run.但是,当我在 Airflow web 中触发 DAG 运行时,它无法运行。

At first Airflow couldn't access the geckodriver.log so I changed the path for one that is accessible.起初 Airflow 无法访问 geckodriver.log,所以我更改了可访问的路径。 So the error changed to not being able to find Firefox.所以错误变为找不到Firefox。 After that, I got that my executable is not an executable.之后,我知道我的可执行文件不是可执行文件。 I'm still looking for possible solutions for any of those steps.我仍在为任何这些步骤寻找可能的解决方案。

The PythonOperator code that runs fine when testing is the following:测试时运行良好的 PythonOperator 代码如下:

from selenium import webdriver

options = webdriver.FirefoxOptions()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
options.add_argument('--window-size=1920x1080')

driver = webdriver.Firefox(executable_path='geckodriver path', log_path='log path', options=options)
driver.get(url)

EDIT: Adding the error messages for each situation.编辑:为每种情况添加错误消息。

Setting executable path and log path: Expected browser binary location, but unable to find binary in default location, no 'moz:firefoxOptions.binary' capability provided, and no binary flag set on the command line .设置可执行路径和日志路径: Expected browser binary location, but unable to find binary in default location, no 'moz:firefoxOptions.binary' capability provided, and no binary flag set on the command line

Also tried this:也试过这个:

options = webdriver.FirefoxOptions()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
options.add_argument('--window-size=1920x1080')
firefox_binary = FirefoxBinary('/usr/bin/firefox')

driver = webdriver.Firefox(firefox_binary=firefox_binary, log_path='log path', options=options)
driver.get(url)

And I get 'geckodriver' executable needs to be in PATH.我得到'geckodriver' executable needs to be in PATH. Which I already added to path, but maybe I'm doing something wrong.我已经添加到路径中,但也许我做错了什么。 Using this method also breaks the code when running airflow test .运行airflow test时,使用此方法也会破坏代码。

Stack traces:堆栈跟踪:

When running using geckodriver (the original implementation):使用 geckodriver(原始实现)运行时:

*** Reading local file: /home/observatorio/airflow/logs/energia_cg_dag/scrap_dados/2021-02-04T16:05:06.663528+00:00/1.log
[2021-02-04 13:05:17,474] {taskinstance.py:670} INFO - Dependencies all met for <TaskInstance: energia_cg_dag.scrap_dados 2021-02-04T16:05:06.663528+00:00 [queued]>
[2021-02-04 13:05:17,486] {taskinstance.py:670} INFO - Dependencies all met for <TaskInstance: energia_cg_dag.scrap_dados 2021-02-04T16:05:06.663528+00:00 [queued]>
[2021-02-04 13:05:17,486] {taskinstance.py:880} INFO - 
--------------------------------------------------------------------------------
[2021-02-04 13:05:17,486] {taskinstance.py:881} INFO - Starting attempt 1 of 2
[2021-02-04 13:05:17,486] {taskinstance.py:882} INFO - 
--------------------------------------------------------------------------------
[2021-02-04 13:05:17,497] {taskinstance.py:901} INFO - Executing <Task(PythonOperator): scrap_dados> on 2021-02-04T16:05:06.663528+00:00
[2021-02-04 13:05:17,503] {standard_task_runner.py:54} INFO - Started process 5151 to run task
[2021-02-04 13:05:17,542] {standard_task_runner.py:77} INFO - Running: ['airflow', 'run', 'energia_cg_dag', 'scrap_dados', '2021-02-04T16:05:06.663528+00:00', '--job_id', '2281', '--pool', 'default_pool', '--raw', '-sd', 'DAGS_FOLDER/energia_cg_dag.py', '--cfg_path', '/tmp/tmpgzryaz90']
[2021-02-04 13:05:17,543] {standard_task_runner.py:78} INFO - Job 2281: Subtask scrap_dados
[2021-02-04 13:05:17,565] {logging_mixin.py:112} INFO - Running %s on host %s <TaskInstance: energia_cg_dag.scrap_dados 2021-02-04T16:05:06.663528+00:00 [running]> ci-dobser-51091
[2021-02-04 13:05:19,534] {taskinstance.py:1150} ERROR - Message: Expected browser binary location, but unable to find binary in default location, no 'moz:firefoxOptions.binary' capability provided, and no binary flag set on the command line
Traceback (most recent call last):
  File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 984, in _run_raw_task
    result = task_copy.execute(context=context)
  File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/airflow/operators/python_operator.py", line 113, in execute
    return_value = self.execute_callable()
  File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/airflow/operators/python_operator.py", line 118, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
  File "/home/observatorio/projetos/Chico-2.0/energia/capacidade_geracao/web_scraping.py", line 47, in scrap_dados
    driver = webdriver.Firefox(executable_path='/home/observatorio/projetos/Chico-2.0/utils/drivers/geckodriver', log_path='/tmp/geckodriver.log', options=options)
  File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/selenium/webdriver/firefox/webdriver.py", line 174, in __init__
    keep_alive=True)
  File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 157, in __init__
    self.start_session(capabilities, browser_profile)
  File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
    response = self.execute(Command.NEW_SESSION, parameters)
  File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
    self.error_handler.check_response(response)
  File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.SessionNotCreatedException: Message: Expected browser binary location, but unable to find binary in default location, no 'moz:firefoxOptions.binary' capability provided, and no binary flag set on the command line

[2021-02-04 13:05:19,570] {taskinstance.py:1194} INFO - Marking task as UP_FOR_RETRY. dag_id=energia_cg_dag, task_id=scrap_dados, execution_date=20210204T160506, start_date=20210204T160517, end_date=20210204T160519
[2021-02-04 13:05:22,470] {local_task_job.py:102} INFO - Task exited with return code 1

Running using the firefox binary as argument for the webdriver:使用 firefox 二进制文件作为 webdriver 的参数运行:

*** Reading local file: /home/observatorio/airflow/logs/energia_cg_dag/scrap_dados/2021-02-04T16:48:29.730315+00:00/1.log
[2021-02-04 13:48:38,335] {taskinstance.py:670} INFO - Dependencies all met for <TaskInstance: energia_cg_dag.scrap_dados 2021-02-04T16:48:29.730315+00:00 [queued]>
[2021-02-04 13:48:38,346] {taskinstance.py:670} INFO - Dependencies all met for <TaskInstance: energia_cg_dag.scrap_dados 2021-02-04T16:48:29.730315+00:00 [queued]>
[2021-02-04 13:48:38,346] {taskinstance.py:880} INFO - 
--------------------------------------------------------------------------------
[2021-02-04 13:48:38,346] {taskinstance.py:881} INFO - Starting attempt 1 of 2
[2021-02-04 13:48:38,346] {taskinstance.py:882} INFO - 
--------------------------------------------------------------------------------
[2021-02-04 13:48:38,355] {taskinstance.py:901} INFO - Executing <Task(PythonOperator): scrap_dados> on 2021-02-04T16:48:29.730315+00:00
[2021-02-04 13:48:38,357] {standard_task_runner.py:54} INFO - Started process 31798 to run task
[2021-02-04 13:48:38,371] {standard_task_runner.py:77} INFO - Running: ['airflow', 'run', 'energia_cg_dag', 'scrap_dados', '2021-02-04T16:48:29.730315+00:00', '--job_id', '2284', '--pool', 'default_pool', '--raw', '-sd', 'DAGS_FOLDER/energia_cg_dag.py', '--cfg_path', '/tmp/tmpz3klaxu5']
[2021-02-04 13:48:38,371] {standard_task_runner.py:78} INFO - Job 2284: Subtask scrap_dados
[2021-02-04 13:48:38,391] {logging_mixin.py:112} INFO - Running %s on host %s <TaskInstance: energia_cg_dag.scrap_dados 2021-02-04T16:48:29.730315+00:00 [running]> ci-dobser-51091
[2021-02-04 13:48:39,291] {taskinstance.py:1150} ERROR - Message: 'geckodriver' executable needs to be in PATH. 
Traceback (most recent call last):
  File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/selenium/webdriver/common/service.py", line 76, in start
    stdin=PIPE)
  File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/subprocess.py", line 800, in __init__
    restore_signals, start_new_session)
  File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/subprocess.py", line 1551, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'geckodriver': 'geckodriver'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 984, in _run_raw_task
    result = task_copy.execute(context=context)
  File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/airflow/operators/python_operator.py", line 113, in execute
    return_value = self.execute_callable()
  File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/airflow/operators/python_operator.py", line 118, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
  File "/home/observatorio/projetos/Chico-2.0/energia/capacidade_geracao/web_scraping.py", line 48, in scrap_dados
    driver = webdriver.Firefox(firefox_binary=firefox_binary, log_path='/tmp/geckodriver.log', options=options)
  File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/selenium/webdriver/firefox/webdriver.py", line 164, in __init__
    self.service.start()
  File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/selenium/webdriver/common/service.py", line 83, in start
    os.path.basename(self.path), self.start_error_message)
selenium.common.exceptions.WebDriverException: Message: 'geckodriver' executable needs to be in PATH. 

[2021-02-04 13:48:39,310] {taskinstance.py:1194} INFO - Marking task as UP_FOR_RETRY. dag_id=energia_cg_dag, task_id=scrap_dados, execution_date=20210204T164829, start_date=20210204T164838, end_date=20210204T164839
[2021-02-04 13:48:43,340] {local_task_job.py:102} INFO - Task exited with return code 1

Passing the driver path and the Firefox binary path:传递驱动程序路径和 Firefox 二进制路径:

*** Reading local file: /home/observatorio/airflow/logs/energia_cg_dag/scrap_dados/2021-02-04T17:27:33.734991+00:00/1.log
[2021-02-04 14:27:46,858] {taskinstance.py:670} INFO - Dependencies all met for <TaskInstance: energia_cg_dag.scrap_dados 2021-02-04T17:27:33.734991+00:00 [queued]>
[2021-02-04 14:27:46,876] {taskinstance.py:670} INFO - Dependencies all met for <TaskInstance: energia_cg_dag.scrap_dados 2021-02-04T17:27:33.734991+00:00 [queued]>
[2021-02-04 14:27:46,876] {taskinstance.py:880} INFO - 
--------------------------------------------------------------------------------
[2021-02-04 14:27:46,876] {taskinstance.py:881} INFO - Starting attempt 1 of 2
[2021-02-04 14:27:46,876] {taskinstance.py:882} INFO - 
--------------------------------------------------------------------------------
[2021-02-04 14:27:46,888] {taskinstance.py:901} INFO - Executing <Task(PythonOperator): scrap_dados> on 2021-02-04T17:27:33.734991+00:00
[2021-02-04 14:27:46,890] {standard_task_runner.py:54} INFO - Started process 58187 to run task
[2021-02-04 14:27:46,907] {standard_task_runner.py:77} INFO - Running: ['airflow', 'run', 'energia_cg_dag', 'scrap_dados', '2021-02-04T17:27:33.734991+00:00', '--job_id', '2302', '--pool', 'default_pool', '--raw', '-sd', 'DAGS_FOLDER/energia_cg_dag.py', '--cfg_path', '/tmp/tmpyetrn10i']
[2021-02-04 14:27:46,907] {standard_task_runner.py:78} INFO - Job 2302: Subtask scrap_dados
[2021-02-04 14:27:46,930] {logging_mixin.py:112} INFO - Running %s on host %s <TaskInstance: energia_cg_dag.scrap_dados 2021-02-04T17:27:33.734991+00:00 [running]> ci-dobser-51091
[2021-02-04 14:27:47,719] {logging_mixin.py:112} INFO - 
[2021-02-04 14:27:47,720] {logging_mixin.py:112} INFO - 
[2021-02-04 14:27:47,720] {logging_mixin.py:112} WARNING - [WDM] - ====== WebDriver manager ======
[2021-02-04 14:27:47,720] {logger.py:22} INFO - ====== WebDriver manager ======
[2021-02-04 14:27:48,073] {logging_mixin.py:112} WARNING - [WDM] - Driver [/home/observatorio/.wdm/drivers/geckodriver/linux64/v0.29.0/geckodriver] found in cache
[2021-02-04 14:27:48,073] {logger.py:12} INFO - Driver [/home/observatorio/.wdm/drivers/geckodriver/linux64/v0.29.0/geckodriver] found in cache
[2021-02-04 14:27:48,181] {taskinstance.py:1150} ERROR - Message: Process unexpectedly closed with status 127
Traceback (most recent call last):
  File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 984, in _run_raw_task
    result = task_copy.execute(context=context)
  File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/airflow/operators/python_operator.py", line 113, in execute
    return_value = self.execute_callable()
  File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/airflow/operators/python_operator.py", line 118, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
  File "/home/observatorio/projetos/Chico-2.0/energia/capacidade_geracao/web_scraping.py", line 50, in scrap_dados
    driver = webdriver.Firefox(executable_path=GeckoDriverManager().install(), firefox_binary=firefox_binary, log_path='/tmp/geckodriver.log', options=options)
  File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/selenium/webdriver/firefox/webdriver.py", line 174, in __init__
    keep_alive=True)
  File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 157, in __init__
    self.start_session(capabilities, browser_profile)
  File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
    response = self.execute(Command.NEW_SESSION, parameters)
  File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
    self.error_handler.check_response(response)
  File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: Process unexpectedly closed with status 127

[2021-02-04 14:27:48,183] {taskinstance.py:1194} INFO - Marking task as UP_FOR_RETRY. dag_id=energia_cg_dag, task_id=scrap_dados, execution_date=20210204T172733, start_date=20210204T172746, end_date=20210204T172748
[2021-02-04 14:27:51,862] {local_task_job.py:102} INFO - Task exited with return code 1

So, I was finally able to make it work.所以,我终于能够让它工作了。 The solution I've found is more of a workaround than anything, but it solved my problem for now.我找到的解决方案更像是一种解决方法,但它现在解决了我的问题。

Since my code was working outside Airflow I decided to try to run it within a BashOperator instead of a PythonOperator, as I saw some people suggesting to be done for similar problems.由于我的代码在 Airflow 之外运行,我决定尝试在 BashOperator 而不是 PythonOperator 中运行它,因为我看到一些人建议针对类似问题进行处理。

At first I was having problem running bash getting the error message: [Errno 2] No such file or directory: 'bash': 'bash' .起初我在运行 bash 时遇到问题,收到错误消息: [Errno 2] No such file or directory: 'bash': 'bash' Which happened because I was missing the correct configuration inside my airflow-scheduler.service file.发生这种情况是因为我在airflow-scheduler.service 文件中缺少正确的配置。

The solution that solved that for me was: Airflow BashOperator can't find Bash .为我解决的解决方案是: Airflow BashOperator can't find Bash After I added the path :/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin to my service file and restarted my scheduler bash started working.在我将路径:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin添加到我的服务文件并重新启动我的调度程序 bash 后开始工作。

The Selenium code now looks like this: Selenium 代码现在如下所示:

from webdriver_manager.firefox import GeckoDriverManager
from selenium import webdriver

options = webdriver.FirefoxOptions()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
options.add_argument('--window-size=1920x1080')

driver = webdriver.Firefox(executable_path=GeckoDriverManager().install(), log_path='/tmp/geckodriver.log', options=options)
driver.get(url)

I'm using the lib webdriver-manager now just because I want to avoid using absolute paths as much as possible in my code, but the original code works as it is.我现在使用 lib webdriver-manager 只是因为我想在我的代码中尽可能避免使用绝对路径,但原始代码可以正常工作。

As for the task instantiation, I'm doing it like this:至于任务实例化,我是这样做的:

scrap_data_task = BashOperator(task_id='scrap_data', bash_command='python /absolute/path/to/python/script/web_scraping.py')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python Selenium 当使用 Firefox geckodriver 的绝对路径时,测试不运行 - Python Selenium test does not run when using absolute path to Firefox geckodriver Airflow 运行 DAG 时调度程序崩溃 - Airflow scheduler crashes when a DAG is run 为什么此获取 Airflow 上下文的代码会在 DAG 导入时运行? - Why does this code to get Airflow context get run on DAG import? 尝试运行 dag 和 Airflow 时返回状态代码 2 - when trying to run dag and Airflow returning status code2 为什么 Airflow 继续运行 DAG? - Why does Airflow keep running the DAG? 尝试在apache上运行wsgi selenium脚本时,“geckodriver.log”的权限被拒绝错误 - Permission Denied error for “geckodriver.log” when trying to run wsgi selenium script on apache 使用气流传感器开始DAG运行 - Using an Airflow sensor to start a DAG run 在 BeamRunPythonPipelineOperator 中使用 Airflow DAG 运行 conf - Using Airflow DAG run conf in BeamRunPythonPipelineOperator 如何检查我的下一次 Airflow DAG 运行何时安排给特定的 dag? - How do I check when my next Airflow DAG run has been scheduled for a specific dag? 为什么 tkinter 模块在通过命令行运行时会引发属性错误,但在通过 IDLE 运行时不会引发属性错误? - Why tkinter module raises attribute error when run via command line but not when run via IDLE?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM