[英]Google Cloud Dataflow: ModuleNotFoundError: No module named 'main' when running integration test
[英]ModuleNotFoundError: No module named 'oracledb' when running GCP Dataflow jobs
我們正在嘗試使用 GCP 數據流和 Python 作業模板連接到 Oracle 數據庫。 由於我們使用無法訪問 Internet 的特殊子網來運行 Dataflow 作業,因此我們使用 setup.py 從 GCS 存儲桶安裝依賴包。
下面是使用 setup.py 創建 Dataflow 模板的命令行:
python3 -m <python_file_name> --runner DataflowRunner --project <project_id> --staging_location <gcs_staging> --temp_location <gcs_temp> --template_location <gcs_template> --region <region> --setup_file=./setup.py
依賴包存儲在 GCP 存儲桶中,並將在作業運行時復制到 Dataflow 工作器並安裝在 Dataflow 工作器上。 對於Oracle數據庫連接,我們使用oracledb-1.0.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl,下載自Z5E056C500A1C4B6A7110B50D800C500A1C4B6A7110B50D80.3 -cp39-cp39-Manylinux_2_17_x86_64.manylinux2014_x86_64.whl
當我們嘗試使用 Cloud Shell 和 DirectRunner 時,它可以成功安裝並識別 oracledb 模塊。 但是,當 Dataflow 作業執行時,它會遇到以下錯誤:
來自工作人員的錯誤消息:回溯(最近一次調用):文件“/usr/local/lib/python3.9/site-packages/dataflow_worker/batchworker.py”,第 772 行,運行中 self._load_main_session(self.local_staging_directory)文件“/usr/local/lib/python3.9/site-packages/dataflow_worker/batchworker.py”,第 509 行,在 _load_main_session pickler.load_session(session_file) 文件“/usr/local/lib/python3.9/site- packages/apache_beam/internal/pickler.py”,第 65 行,在 load_session 中返回 desired_pickle_lib.load_session(file_path) 文件“/usr/local/lib/python3.9/site-packages/apache_beam/internal/dill_pickler.py”,行313,在load_session中返回dill.load_session(file_path)文件“/usr/local/lib/python3.9/site-packages/dill/_dill.py”,第368行,在load_session模塊=unpickler.load()文件“/ usr/local/lib/python3.9/site-packages/dill/_dill.py”,第 472 行,加載 obj = StockUnpickler.load(self) 文件“/usr/local/lib/python3.9/site-packages /dill/_dill.py",第 826 行,在 _import_module 中返回import (import_name) Mod uleNotFoundError:沒有名為“oracledb”的模塊
非常感謝您的建議。
安裝程序.py
import os
import logging
import subprocess
import pickle
import setuptools
import distutils
from setuptools.command.install import install as _install
class install(_install): # pylint: disable=invalid-name
def run(self):
self.run_command('CustomCommands')
_install.run(self)
WHEEL_PACKAGES = [
'wheel-0.37.1-py2.py3-none-any.whl',
'oracledb-1.0.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl'
]
CUSTOM_COMMANDS = [
['sudo', 'apt-get', 'update']
]
class CustomCommands(setuptools.Command):
"""A setuptools Command class able to run arbitrary commands."""
def initialize_options(self):
pass
def finalize_options(self):
pass
def run_command(self, command):
import subprocess
import logging
logging.getLogger().setLevel(logging.INFO)
status = -9999
try:
logging.info('CUSTOM_DATAFLOW_JOB_LOG: started running [{}]'.format(command))
status = subprocess.call(command)
if status == 0:
logging.info('CUSTOM_DATAFLOW_JOB_LOG: [{}] completed successfully'.format(command))
else:
logging.error('CUSTOM_DATAFLOW_JOB_LOG: [{}] failed with signal {}'.format(command, status))
except Exception as e:
logging.error('CUSTOM_DATAFLOW_JOB_LOG: [{}] caught exception: {}'.format(command, e))
return status
def install_cmd(self):
result = []
for p in WHEEL_PACKAGES:
result.append(['gsutil', 'cp', 'gs://dataflow-execution/python_dependencies/{}'.format(p), '.'])
result.append(['pip', 'install', '{}'.format(p)])
return result
def run(self):
import logging
logging.getLogger().setLevel(logging.INFO)
try:
install_cmd = self.install_cmd()
for command in CUSTOM_COMMANDS:
status = self.run_command(command)
if status == 0:
logging.info('CUSTOM_DATAFLOW_JOB_LOG: [{}] finished successfully'.format(command))
else:
logging.error('CUSTOM_DATAFLOW_JOB_LOG: [{}] failed with status code {}'.format(command, status))
for command in install_cmd:
status = self.run_command(command)
if status == 0:
logging.info('CUSTOM_DATAFLOW_JOB_LOG: [{}] finished successfully'.format(command))
else:
logging.error('CUSTOM_DATAFLOW_JOB_LOG: [{}] failed with status code {}'.format(command, status))
except Exception as e:
logging.error('CUSTOM_DATAFLOW_JOB_LOG: [{}] caught exception: {}'.format(command, e))
REQUIRED_PACKAGES = [
]
print("======\nRunning setup.py\n==========")
setuptools.setup(
name='main_setup',
version='1.0.0',
description='DataFlow worker',
install_requires=REQUIRED_PACKAGES,
packages=setuptools.find_packages(),
cmdclass={
'install': install,
'CustomCommands': CustomCommands,
}
)```
您是否已驗證數據流工作人員肯定可以訪問該 gcs 存儲桶? 這可能會導致這里出現問題。
一般來說,我相信這類事情的推薦路徑是使用 --extra_package 標志 - https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#local-or-nonpypi - 你這樣做可能會有更多的運氣。
為什么子網無法訪問互聯網? 您可以在 Google Cloud 上創建路由器和網關(雲 NAT 網關),以免將(數據流)VM IP 從外部公開到互聯網。
路由器是為 VPC 網絡創建的(您的子網在此 VPC 中):
並且 NAT 網關是使用以前的路由器創建的。
然后從 PyPi 和 setup.py 文件下載 package 會變得非常容易。 安裝文件中來自PyPi
的oracledb
package 示例:
from glob import glob
from setuptools import find_packages, setup
setup(
name="lib",
version="0.0.1",
install_requires=['oracledb==1.0.3'],
packages=find_packages(),
)
然后Dataflow
將毫無問題地在worker中安裝package。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.