[英]Why does Selenium and geckodriver work when run using airflow test but raises an error when running inside a DAG run?
I'm running a web scraping scripts using Python, Selenium and Geckodriver.我正在使用 Python、Selenium 和 Geckodriver 运行 web 抓取脚本。 The problem is: when I run a task test using
airflow test scrap_dag scrap_data 2020-01-01
everything works just fine and the file I want is downloaded correctly.问题是:当我使用
airflow test scrap_dag scrap_data 2020-01-01
运行任务测试时,一切正常,并且我想要的文件已正确下载。 However, when I trigger a DAG run in Airflow web it fails to run.但是,当我在 Airflow web 中触发 DAG 运行时,它无法运行。
At first Airflow couldn't access the geckodriver.log so I changed the path for one that is accessible.起初 Airflow 无法访问 geckodriver.log,所以我更改了可访问的路径。 So the error changed to not being able to find Firefox.
所以错误变为找不到Firefox。 After that, I got that my executable is not an executable.
之后,我知道我的可执行文件不是可执行文件。 I'm still looking for possible solutions for any of those steps.
我仍在为任何这些步骤寻找可能的解决方案。
The PythonOperator code that runs fine when testing is the following:测试时运行良好的 PythonOperator 代码如下:
from selenium import webdriver
options = webdriver.FirefoxOptions()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
options.add_argument('--window-size=1920x1080')
driver = webdriver.Firefox(executable_path='geckodriver path', log_path='log path', options=options)
driver.get(url)
EDIT: Adding the error messages for each situation.编辑:为每种情况添加错误消息。
Setting executable path and log path: Expected browser binary location, but unable to find binary in default location, no 'moz:firefoxOptions.binary' capability provided, and no binary flag set on the command line
.设置可执行路径和日志路径:
Expected browser binary location, but unable to find binary in default location, no 'moz:firefoxOptions.binary' capability provided, and no binary flag set on the command line
。
Also tried this:也试过这个:
options = webdriver.FirefoxOptions()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
options.add_argument('--window-size=1920x1080')
firefox_binary = FirefoxBinary('/usr/bin/firefox')
driver = webdriver.Firefox(firefox_binary=firefox_binary, log_path='log path', options=options)
driver.get(url)
And I get 'geckodriver' executable needs to be in PATH.
我得到
'geckodriver' executable needs to be in PATH.
Which I already added to path, but maybe I'm doing something wrong.我已经添加到路径中,但也许我做错了什么。 Using this method also breaks the code when running
airflow test
.运行
airflow test
时,使用此方法也会破坏代码。
Stack traces:堆栈跟踪:
When running using geckodriver (the original implementation):使用 geckodriver(原始实现)运行时:
*** Reading local file: /home/observatorio/airflow/logs/energia_cg_dag/scrap_dados/2021-02-04T16:05:06.663528+00:00/1.log
[2021-02-04 13:05:17,474] {taskinstance.py:670} INFO - Dependencies all met for <TaskInstance: energia_cg_dag.scrap_dados 2021-02-04T16:05:06.663528+00:00 [queued]>
[2021-02-04 13:05:17,486] {taskinstance.py:670} INFO - Dependencies all met for <TaskInstance: energia_cg_dag.scrap_dados 2021-02-04T16:05:06.663528+00:00 [queued]>
[2021-02-04 13:05:17,486] {taskinstance.py:880} INFO -
--------------------------------------------------------------------------------
[2021-02-04 13:05:17,486] {taskinstance.py:881} INFO - Starting attempt 1 of 2
[2021-02-04 13:05:17,486] {taskinstance.py:882} INFO -
--------------------------------------------------------------------------------
[2021-02-04 13:05:17,497] {taskinstance.py:901} INFO - Executing <Task(PythonOperator): scrap_dados> on 2021-02-04T16:05:06.663528+00:00
[2021-02-04 13:05:17,503] {standard_task_runner.py:54} INFO - Started process 5151 to run task
[2021-02-04 13:05:17,542] {standard_task_runner.py:77} INFO - Running: ['airflow', 'run', 'energia_cg_dag', 'scrap_dados', '2021-02-04T16:05:06.663528+00:00', '--job_id', '2281', '--pool', 'default_pool', '--raw', '-sd', 'DAGS_FOLDER/energia_cg_dag.py', '--cfg_path', '/tmp/tmpgzryaz90']
[2021-02-04 13:05:17,543] {standard_task_runner.py:78} INFO - Job 2281: Subtask scrap_dados
[2021-02-04 13:05:17,565] {logging_mixin.py:112} INFO - Running %s on host %s <TaskInstance: energia_cg_dag.scrap_dados 2021-02-04T16:05:06.663528+00:00 [running]> ci-dobser-51091
[2021-02-04 13:05:19,534] {taskinstance.py:1150} ERROR - Message: Expected browser binary location, but unable to find binary in default location, no 'moz:firefoxOptions.binary' capability provided, and no binary flag set on the command line
Traceback (most recent call last):
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 984, in _run_raw_task
result = task_copy.execute(context=context)
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/airflow/operators/python_operator.py", line 113, in execute
return_value = self.execute_callable()
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/airflow/operators/python_operator.py", line 118, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/home/observatorio/projetos/Chico-2.0/energia/capacidade_geracao/web_scraping.py", line 47, in scrap_dados
driver = webdriver.Firefox(executable_path='/home/observatorio/projetos/Chico-2.0/utils/drivers/geckodriver', log_path='/tmp/geckodriver.log', options=options)
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/selenium/webdriver/firefox/webdriver.py", line 174, in __init__
keep_alive=True)
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 157, in __init__
self.start_session(capabilities, browser_profile)
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.SessionNotCreatedException: Message: Expected browser binary location, but unable to find binary in default location, no 'moz:firefoxOptions.binary' capability provided, and no binary flag set on the command line
[2021-02-04 13:05:19,570] {taskinstance.py:1194} INFO - Marking task as UP_FOR_RETRY. dag_id=energia_cg_dag, task_id=scrap_dados, execution_date=20210204T160506, start_date=20210204T160517, end_date=20210204T160519
[2021-02-04 13:05:22,470] {local_task_job.py:102} INFO - Task exited with return code 1
Running using the firefox binary as argument for the webdriver:使用 firefox 二进制文件作为 webdriver 的参数运行:
*** Reading local file: /home/observatorio/airflow/logs/energia_cg_dag/scrap_dados/2021-02-04T16:48:29.730315+00:00/1.log
[2021-02-04 13:48:38,335] {taskinstance.py:670} INFO - Dependencies all met for <TaskInstance: energia_cg_dag.scrap_dados 2021-02-04T16:48:29.730315+00:00 [queued]>
[2021-02-04 13:48:38,346] {taskinstance.py:670} INFO - Dependencies all met for <TaskInstance: energia_cg_dag.scrap_dados 2021-02-04T16:48:29.730315+00:00 [queued]>
[2021-02-04 13:48:38,346] {taskinstance.py:880} INFO -
--------------------------------------------------------------------------------
[2021-02-04 13:48:38,346] {taskinstance.py:881} INFO - Starting attempt 1 of 2
[2021-02-04 13:48:38,346] {taskinstance.py:882} INFO -
--------------------------------------------------------------------------------
[2021-02-04 13:48:38,355] {taskinstance.py:901} INFO - Executing <Task(PythonOperator): scrap_dados> on 2021-02-04T16:48:29.730315+00:00
[2021-02-04 13:48:38,357] {standard_task_runner.py:54} INFO - Started process 31798 to run task
[2021-02-04 13:48:38,371] {standard_task_runner.py:77} INFO - Running: ['airflow', 'run', 'energia_cg_dag', 'scrap_dados', '2021-02-04T16:48:29.730315+00:00', '--job_id', '2284', '--pool', 'default_pool', '--raw', '-sd', 'DAGS_FOLDER/energia_cg_dag.py', '--cfg_path', '/tmp/tmpz3klaxu5']
[2021-02-04 13:48:38,371] {standard_task_runner.py:78} INFO - Job 2284: Subtask scrap_dados
[2021-02-04 13:48:38,391] {logging_mixin.py:112} INFO - Running %s on host %s <TaskInstance: energia_cg_dag.scrap_dados 2021-02-04T16:48:29.730315+00:00 [running]> ci-dobser-51091
[2021-02-04 13:48:39,291] {taskinstance.py:1150} ERROR - Message: 'geckodriver' executable needs to be in PATH.
Traceback (most recent call last):
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/selenium/webdriver/common/service.py", line 76, in start
stdin=PIPE)
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/subprocess.py", line 800, in __init__
restore_signals, start_new_session)
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/subprocess.py", line 1551, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'geckodriver': 'geckodriver'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 984, in _run_raw_task
result = task_copy.execute(context=context)
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/airflow/operators/python_operator.py", line 113, in execute
return_value = self.execute_callable()
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/airflow/operators/python_operator.py", line 118, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/home/observatorio/projetos/Chico-2.0/energia/capacidade_geracao/web_scraping.py", line 48, in scrap_dados
driver = webdriver.Firefox(firefox_binary=firefox_binary, log_path='/tmp/geckodriver.log', options=options)
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/selenium/webdriver/firefox/webdriver.py", line 164, in __init__
self.service.start()
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/selenium/webdriver/common/service.py", line 83, in start
os.path.basename(self.path), self.start_error_message)
selenium.common.exceptions.WebDriverException: Message: 'geckodriver' executable needs to be in PATH.
[2021-02-04 13:48:39,310] {taskinstance.py:1194} INFO - Marking task as UP_FOR_RETRY. dag_id=energia_cg_dag, task_id=scrap_dados, execution_date=20210204T164829, start_date=20210204T164838, end_date=20210204T164839
[2021-02-04 13:48:43,340] {local_task_job.py:102} INFO - Task exited with return code 1
Passing the driver path and the Firefox binary path:传递驱动程序路径和 Firefox 二进制路径:
*** Reading local file: /home/observatorio/airflow/logs/energia_cg_dag/scrap_dados/2021-02-04T17:27:33.734991+00:00/1.log
[2021-02-04 14:27:46,858] {taskinstance.py:670} INFO - Dependencies all met for <TaskInstance: energia_cg_dag.scrap_dados 2021-02-04T17:27:33.734991+00:00 [queued]>
[2021-02-04 14:27:46,876] {taskinstance.py:670} INFO - Dependencies all met for <TaskInstance: energia_cg_dag.scrap_dados 2021-02-04T17:27:33.734991+00:00 [queued]>
[2021-02-04 14:27:46,876] {taskinstance.py:880} INFO -
--------------------------------------------------------------------------------
[2021-02-04 14:27:46,876] {taskinstance.py:881} INFO - Starting attempt 1 of 2
[2021-02-04 14:27:46,876] {taskinstance.py:882} INFO -
--------------------------------------------------------------------------------
[2021-02-04 14:27:46,888] {taskinstance.py:901} INFO - Executing <Task(PythonOperator): scrap_dados> on 2021-02-04T17:27:33.734991+00:00
[2021-02-04 14:27:46,890] {standard_task_runner.py:54} INFO - Started process 58187 to run task
[2021-02-04 14:27:46,907] {standard_task_runner.py:77} INFO - Running: ['airflow', 'run', 'energia_cg_dag', 'scrap_dados', '2021-02-04T17:27:33.734991+00:00', '--job_id', '2302', '--pool', 'default_pool', '--raw', '-sd', 'DAGS_FOLDER/energia_cg_dag.py', '--cfg_path', '/tmp/tmpyetrn10i']
[2021-02-04 14:27:46,907] {standard_task_runner.py:78} INFO - Job 2302: Subtask scrap_dados
[2021-02-04 14:27:46,930] {logging_mixin.py:112} INFO - Running %s on host %s <TaskInstance: energia_cg_dag.scrap_dados 2021-02-04T17:27:33.734991+00:00 [running]> ci-dobser-51091
[2021-02-04 14:27:47,719] {logging_mixin.py:112} INFO -
[2021-02-04 14:27:47,720] {logging_mixin.py:112} INFO -
[2021-02-04 14:27:47,720] {logging_mixin.py:112} WARNING - [WDM] - ====== WebDriver manager ======
[2021-02-04 14:27:47,720] {logger.py:22} INFO - ====== WebDriver manager ======
[2021-02-04 14:27:48,073] {logging_mixin.py:112} WARNING - [WDM] - Driver [/home/observatorio/.wdm/drivers/geckodriver/linux64/v0.29.0/geckodriver] found in cache
[2021-02-04 14:27:48,073] {logger.py:12} INFO - Driver [/home/observatorio/.wdm/drivers/geckodriver/linux64/v0.29.0/geckodriver] found in cache
[2021-02-04 14:27:48,181] {taskinstance.py:1150} ERROR - Message: Process unexpectedly closed with status 127
Traceback (most recent call last):
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 984, in _run_raw_task
result = task_copy.execute(context=context)
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/airflow/operators/python_operator.py", line 113, in execute
return_value = self.execute_callable()
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/airflow/operators/python_operator.py", line 118, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/home/observatorio/projetos/Chico-2.0/energia/capacidade_geracao/web_scraping.py", line 50, in scrap_dados
driver = webdriver.Firefox(executable_path=GeckoDriverManager().install(), firefox_binary=firefox_binary, log_path='/tmp/geckodriver.log', options=options)
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/selenium/webdriver/firefox/webdriver.py", line 174, in __init__
keep_alive=True)
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 157, in __init__
self.start_session(capabilities, browser_profile)
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/home/observatorio/anaconda3/envs/airflow/lib/python3.7/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: Process unexpectedly closed with status 127
[2021-02-04 14:27:48,183] {taskinstance.py:1194} INFO - Marking task as UP_FOR_RETRY. dag_id=energia_cg_dag, task_id=scrap_dados, execution_date=20210204T172733, start_date=20210204T172746, end_date=20210204T172748
[2021-02-04 14:27:51,862] {local_task_job.py:102} INFO - Task exited with return code 1
So, I was finally able to make it work.所以,我终于能够让它工作了。 The solution I've found is more of a workaround than anything, but it solved my problem for now.
我找到的解决方案更像是一种解决方法,但它现在解决了我的问题。
Since my code was working outside Airflow I decided to try to run it within a BashOperator instead of a PythonOperator, as I saw some people suggesting to be done for similar problems.由于我的代码在 Airflow 之外运行,我决定尝试在 BashOperator 而不是 PythonOperator 中运行它,因为我看到一些人建议针对类似问题进行处理。
At first I was having problem running bash getting the error message: [Errno 2] No such file or directory: 'bash': 'bash'
.起初我在运行 bash 时遇到问题,收到错误消息:
[Errno 2] No such file or directory: 'bash': 'bash'
。 Which happened because I was missing the correct configuration inside my airflow-scheduler.service file.发生这种情况是因为我在airflow-scheduler.service 文件中缺少正确的配置。
The solution that solved that for me was: Airflow BashOperator can't find Bash .为我解决的解决方案是: Airflow BashOperator can't find Bash 。 After I added the path
:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin
to my service file and restarted my scheduler bash started working.在我将路径
:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin
添加到我的服务文件并重新启动我的调度程序 bash 后开始工作。
The Selenium code now looks like this: Selenium 代码现在如下所示:
from webdriver_manager.firefox import GeckoDriverManager
from selenium import webdriver
options = webdriver.FirefoxOptions()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
options.add_argument('--window-size=1920x1080')
driver = webdriver.Firefox(executable_path=GeckoDriverManager().install(), log_path='/tmp/geckodriver.log', options=options)
driver.get(url)
I'm using the lib webdriver-manager now just because I want to avoid using absolute paths as much as possible in my code, but the original code works as it is.我现在使用 lib webdriver-manager 只是因为我想在我的代码中尽可能避免使用绝对路径,但原始代码可以正常工作。
As for the task instantiation, I'm doing it like this:至于任务实例化,我是这样做的:
scrap_data_task = BashOperator(task_id='scrap_data', bash_command='python /absolute/path/to/python/script/web_scraping.py')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.