简体   繁体   English

如何通过 Google Cloud Functions 连接到 PostgreSQL(beam-nuggets)来部署 Google Cloud Dataflow

[英]How to deploy Google Cloud Dataflow with connection to PostgreSQL (beam-nuggets) from Google Cloud Functions

I'm trying to create ETL in GCP which will read part of data from PostgreSQL and put it in the suitable form to BigQuery.我正在尝试在 GCP 中创建 ETL,它将从 PostgreSQL 读取部分数据并将其以合适的形式放入 BigQuery。 I was able to perform this task deploying Dataflow from my computer, but I failed to make it dynamic, so it will read last transferred record and transfer next 100. So I figured out, that I'll create Dataflows from Cloud Function.我能够执行从我的计算机部署数据流的任务,但我未能使其成为动态的,因此它将读取最后传输的记录并传输下一个 100。所以我想,我将从云 Function 创建数据流。 Everything was working OK, reading/writing to BigQuery works like a charm, but I'm stuck on PostgreSQL requited package: beam-nuggets.一切正常,读/写 BigQuery 就像一个魅力,但我被困在 PostgreSQL 要求 package:beam-nuggets 上。

In the function I'm creating pipe arguments:在 function 我正在创建 pipe arguments:

pipe_arguments = [    
    '--project={0}'.format(PROJECT),
    '--staging_location=gs://xxx.appspot.com/staging/',
    '--temp_location=gs://xxx.appspot.com/temp/',
    '--runner=DataflowRunner',
    '--region=europe-west4',
    '--setup_file=./setup.py'
    ]

    pipeline_options = PipelineOptions(pipe_arguments)
    pipeline_options.view_as(SetupOptions).save_main_session = save_main_session

Then create pipeline:然后创建管道:

 pipeline = beam.Pipeline(argv = pipe_arguments) 

and run it:并运行它:

pipeline.run()

If I omit:如果我省略:

    '--setup_file=./setup.py'

everything is fine except Dataflow cannot use PostgeQSL as import:一切都很好,除了 Dataflow 不能使用 PostgeQSL 作为导入:

from beam_nuggets.io import relational_db

fails.失败。

When I add当我添加

    '--setup_file=./setup.py'

line, testing function from GCP Function web portal returns:行,从 GCP 测试 function Function web 门户返回:

Error: function terminated. Recommended action: inspect logs for termination reason. Details:
Full trace: Traceback (most recent call last):
  File "/env/local/lib/python3.7/site-packages/apache_beam/utils/processes.py", line 85, in check_output
    out = subprocess.check_output(*args, **kwargs)
  File "/opt/python3.7/lib/python3.7/subprocess.py", line 411, in check_output
    **kwargs).stdout
  File "/opt/python3.7/lib/python3.7/subprocess.py", line 512, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/env/bin/python3.7', 'setup.py', 'sdist', '--dist-dir', '/tmp/tmpxdvj0ulx']' returned non-zero exit status 1.
,          output of the failed child process b'running sdist\nrunning egg_info\ncreating example.egg-info\n'

running跑步

python setup.py sdist --dist-dir ./tmp/

from local computer works OK.从本地计算机工作正常。

setup.py is deployed along with function code (main.py) and requirements.txt to the Cloud Function. setup.py 与 function 代码 (main.py) 和 requirements.txt 一起部署到云 Function。

Requirements.txt is used during Function deploy and looks like this: Requirements.txt 在 Function 部署期间使用,如下所示:

beam-nuggets==0.15.1
google-cloud-bigquery==1.17.1
apache-beam==2.19.0
google-cloud-dataflow==2.4.0
google-apitools==0.5.31

setup.py looks like this: setup.py 看起来像这样:

from setuptools import find_packages
from setuptools import setup

REQUIRED_PACKAGES = ['beam-nuggets>=0.15.1']

setup(
    name='example',
    version='0.1',
    install_requires=REQUIRED_PACKAGES,
    packages=find_packages(),
    include_package_data=True,
    description='example desc'
)

I'm stuck for couple days, tried different setup.py approaches, tried to use requirements.txt instead of setup.py - no luck.我被困了几天,尝试了不同的 setup.py 方法,尝试使用 requirements.txt 而不是 setup.py - 没有运气。

log just says:日志只是说:

 {
 insertId: "000000-88232bc6-6122-4ec8-a4f3-90e9775e89f6"  
 
labels: {
  execution_id: "78ml14shfolv"   
 }
 logName: "projects/xxx/logs/cloudfunctions.googleapis.com%2Fcloud-functions"  
 receiveTimestamp: "2020-07-13T12:08:35.898729649Z"  
 
resource: {
  
labels: {
   function_name: "xxx"    
   project_id: "xxx"    
   region: "europe-west6"    
  }
  type: "cloud_function"   
 }
 severity: "INFO"  
 textPayload: "Executing command: ['/env/bin/python3.7', 'setup.py', 'sdist', '--dist-dir', '/tmp/tmpxdvj0ulx']"  
 timestamp: "2020-07-13T12:08:31.639Z"  
 trace: "projects/xxx/traces/c9f1b1f68ed869f187e04ea672c487a4"  
}
 {
 insertId: "000000-3dfb239a-4067-4f9d-bd5f-bae5174e9dc7"  
 
labels: {
  execution_id: "78ml14shfolv"   
 }
 logName: "projects/xxx/logs/cloudfunctions.googleapis.com%2Fcloud-functions"  
 receiveTimestamp: "2020-07-13T12:08:35.898729649Z"  
 
resource: {
  
labels: {
   function_name: "xxx"    
   project_id: "xxx"    
   region: "europe-west6"    
  }
  type: "cloud_function"   
 }
 severity: "DEBUG"  
 textPayload: "Function execution took 7798 ms, finished with status: 'crash'"  
 timestamp: "2020-07-13T12:08:35.663674738Z"  
 trace: "projects/xxx/traces/c9f1b1f68ed869f187e04ea672c487a4"  
}

Supplementary info:补充资料:

if I'm using如果我正在使用

'--requirements_file=./requirements.txt'

instead of代替

'--setup_file=./setup.py'

I'm getting:我越来越:

Error: memory limit exceeded.

in GCP Functions web portal while running test function.在 GCP 功能 web 门户中运行测试 function。

Afrer I increased memory to 2BG it says: Afrer 我将 memory 增加到 2BG 它说:

Error: function terminated. Recommended action: inspect logs for termination reason. Details:
Full traceback: Traceback (most recent call last):
  File "/env/local/lib/python3.7/site-packages/apache_beam/utils/processes.py", line 85, in check_output
    out = subprocess.check_output(*args, **kwargs)
  File "/opt/python3.7/lib/python3.7/subprocess.py", line 411, in check_output
    **kwargs).stdout
  File "/opt/python3.7/lib/python3.7/subprocess.py", line 512, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/env/bin/python3.7', '-m', 'pip', 'download', '--dest', '/tmp/dataflow-requirements-cache', '-r', './requirements.txt', '--exists-action', 'i', '--no-binary', ':all:']' returned non-zero exit status 1. 
 Pip install failed for package: -r         
 Output from execution of subprocess: b'Collecting beam-nuggets==0.15.1  
 Downloading beam-nuggets-0.15.1.tar.gz (17 kB)
  Saved /tmp/dataflow-requirements-cache/beam-nuggets-0.15.1.tar.gz
Collecting google-cloud-bigquery==1.17.1
  Downloading google-cloud-bigquery-1.17.1.tar.gz (228 kB)
  Saved /tmp/dataflow-requirements-cache/google-cloud-bigquery-1.17.1.tar.gz
Collecting apache-beam==2.19.0
  Downloading apache-beam-2.19.0.zip (1.9 MB)
  Saved /tmp/dataflow-requirements-cache/apache-beam-2.19.0.zip
Collecting google-cloud-dataflow==2.4.0
  Downloading google-cloud-dataflow-2.4.0.tar.gz (5.8 kB)
  Saved /tmp/dataflow-requirements-cache/google-cloud-dataflow-2.4.0.tar.gz
Collecting google-apitools==0.5.31
  Downloading google-apitools-0.5.31.tar.gz (173 kB)
  Saved /tmp/dataflow-requirements-cache/google-apitools-0.5.31.tar.gz
Collecting SQLAlchemy<2.0.0,>=1.2.14
  Downloading SQLAlchemy-1.3.18.tar.gz (6.0 MB)
  Saved /tmp/dataflow-requirements-cache/SQLAlchemy-1.3.18.tar.gz
Collecting sqlalchemy-utils<0.34,>=0.33.11
  Downloading SQLAlchemy-Utils-0.33.11.tar.gz (128 kB)
  Saved /tmp/dataflow-requirements-cache/SQLAlchemy-Utils-0.33.11.tar.gz
Collecting pg8000<2.0.0,>=1.12.4
  Downloading pg8000-1.16.0.tar.gz (75 kB)
  Saved /tmp/dataflow-requirements-cache/pg8000-1.16.0.tar.gz
Collecting PyMySQL<2.0.0,>=0.9.3
  Downloading PyMySQL-0.9.3.tar.gz (75 kB)
  Saved /tmp/dataflow-requirements-cache/PyMySQL-0.9.3.tar.gz
Collecting kafka>===1.3.5
  Downloading kafka-1.3.5.tar.gz (227 kB)
  Saved /tmp/dataflow-requirements-cache/kafka-1.3.5.tar.gz
Collecting google-cloud-core<2.0dev,>=1.0.0
 Downloading google-cloud-core-1.3.0.tar.gz (32 kB)
  Saved /tmp/dataflow-requirements-cache/google-cloud-core-1.3.0.tar.gz
Collecting google-resumable-media<0.5.0dev,>=0.3.1
  Downloading google-resumable-media-0.4.1.tar.gz (2.1 MB)
  Saved /tmp/dataflow-requirements-cache/google-resumable-media-0.4.1.tar.gz
Collecting protobuf>=3.6.0
  Downloading protobuf-3.12.2.tar.gz (265 kB)
  Saved /tmp/dataflow-requirements-cache/protobuf-3.12.2.tar.gz
Collecting crcmod<2.0,>=1.7
  Downloading crcmod-1.7.tar.gz (89 kB)
  Saved /tmp/dataflow-requirements-cache/crcmod-1.7.tar.gz
Collecting dill<0.3.2,>=0.3.1.1
  Downloading dill-0.3.1.1.tar.gz (151 kB)
  Saved /tmp/dataflow-requirements-cache/dill-0.3.1.1.tar.gz
Collecting fastavro<0.22,>=0.21.4
  Downloading fastavro-0.21.24.tar.gz (496 kB)
  Saved /tmp/dataflow-requirements-cache/fastavro-0.21.24.tar.gz
Collecting future<1.0.0,>=0.16.0
  Downloading future-0.18.2.tar.gz (829 kB)
  Saved /tmp/dataflow-requirements-cache/future-0.18.2.tar.gz
Collecting grpcio<2,>=1.12.1
  Downloading grpcio-1.30.0.tar.gz (19.7 MB)
    ERROR: Command errored out with exit status 1:
     command: /env/bin/python3.7 -c \'import sys, setuptools, tokenize; sys.argv[0] = \'"\'"\'/tmp/pip-download-yjpzrbur/grpcio/setup.py\'"\'"\'; __file__=\'"\'"\'/tmp/pip-download-yjpzrbur/grpcio/setup.py\'"\'"\';f=getattr(tokenize, \'"\'"\'open\'"\'"\', open)(__file__);code=f.read().replace(\'"\'"\'\\r\
\'"\'"\', \'"\'"\'\
\'"\'"\');f.close();exec(compile(code, __file__, \'"\'"\'exec\'"\'"\'))\' egg_info --egg-base /tmp/pip-download-yjpzrbur/grpcio/pip-egg-info
         cwd: /tmp/pip-download-yjpzrbur/grpcio/
    Complete output (11 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-download-yjpzrbur/grpcio/setup.py", line 196, in <module>
        if check_linker_need_libatomic():
      File "/tmp/pip-download-yjpzrbur/grpcio/setup.py", line 156, in check_linker_need_libatomic
        stderr=PIPE)
      File "/opt/python3.7/lib/python3.7/subprocess.py", line 800, in __init__
        restore_signals, start_new_session)
      File "/opt/python3.7/lib/python3.7/subprocess.py", line 1551, in _execute_child
        raise child_exception_type(errno_num, err_msg, err_filename)
    FileNotFoundError: [Errno 2] No such file or directory: \'cc\': \'cc\'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
WARNING: You are using pip version 20.0.2; however, version 20.1.1 is available.
You should consider upgrading via the \'/env/bin/python3.7 -m pip install --upgrade pip\' command.
'

Logs in this case:这种情况下的日志:

 {
 insertId: "000000-5e4c10f4-d542-4631-8aaa-b9306d1390fd"  
 
labels: {
  execution_id: "15jww0sd8uyz"   
 }
 logName: "projects/xxx/logs/cloudfunctions.googleapis.com%2Fcloud-functions"  
 receiveTimestamp: "2020-07-13T14:01:33.505683371Z"  
 
resource: {
  
labels: {
   function_name: xxx"    
   project_id: "xxx"    
   region: "europe-west6"    
  }
  type: "cloud_function"   
 }
 severity: "DEBUG"  
 textPayload: "Function execution took 18984 ms, finished with status: 'crash'"  
 timestamp: "2020-07-13T14:01:32.953194652Z"  
 trace: "projects/xxx/traces/262224a3d230cd9a66b1eebba3d7c3e0"  
}

From local machine Dataflow deployment works OK.从本地机器数据流部署工作正常。

Command from logs:来自日志的命令:

python -m pip download --dest ./tmp -r ./requirements.txt --exists-action i --no-binary :all:

also works OK although it seems like downloading half of the internet for couple of minutes, even if I reduce requirements.txt to beam-nuggets==0.15.1 only.也可以正常工作,尽管它似乎需要下载一半的互联网几分钟,即使我将 requirements.txt 减少到 beam-nuggets==0.15.1 也是如此。

It stucks on它卡在

grpcio-1.30.0.tar.gz (19.7 MB)

exactly during setup from this package, function:正是在从这个 package、function 设置期间:

def check_linker_need_libatomic():
    """Test if linker on system needs libatomic."""
    code_test = (b'#include <atomic>\n' +
                 b'int main() { return std::atomic<int64_t>{}; }')
    cc_test = subprocess.Popen(['cc', '-x', 'c++', '-std=c++11', '-'],
                               stdin=PIPE,
                               stdout=PIPE,
                               stderr=PIPE)
    cc_test.communicate(input=code_test)
    return cc_test.returncode != 0

I also tried GCP AppEngine instead of Cloud Functions, with the same result, however it directs me to the proper solution.我还尝试了 GCP AppEngine 而不是 Cloud Functions,结果相同,但它引导我找到正确的解决方案。 Thanks to this and this I was able to create external package from beam-nuggets and include it using --extra_package instead of --setup_file or --setup_file .多亏了一点,我能够从 beam-nuggets 创建外部 package 并使用--extra_package而不是--setup_file--setup_file包含它。

The problem with grpcio compilation (forced by non configurable --no-binary', ':all:' ) remains. grpcio 编译的问题(由不可配置--no-binary', ':all:'强制)仍然存在。 The problem with setup.py weird error also remains. setup.py 奇怪错误的问题也仍然存在。

But deployment from Cloud Functions to Dataflow (with dependencies) is working, so problem closed for me.但是从 Cloud Functions 到 Dataflow(具有依赖项)的部署正在运行,所以问题对我来说已经解决了。

Update:更新:

Just after that I was hit with the problem:就在那之后,我遇到了这个问题:

in _import_module return __import__(import_name) ModuleNotFoundError: No module named 'main'

as I was not using any 'main' module it was hard to find, that I have to pack to the external package also every function defined in my main.py file (thus module name).因为我没有使用任何很难找到的“主”模块,所以我必须打包到外部 package 以及在我的 main.py 文件中定义的每个 function (因此模块名称)。 So extra_package file contains all external dependencies and my own module in which my functions are stored.因此extra_package文件包含所有外部依赖项和我自己的模块,其中存储了我的函数。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在 Google Cloud Dataflow 中为 PostgreSQL 实现连接池 - How to implement connection pooling for PostgreSQL in Google Cloud Dataflow 由于 Beam-nuggets 引用 sqlalchemy,数据流作业失败 - Dataflow jobs are failed due to Beam-nuggets referencing to sqlalchemy 使用 Dataflow + Beam + Python 从 Google Cloud Storage 读取 Shapefile - Read Shapefile from Google Cloud Storage using Dataflow + Beam + Python Beam/Google Cloud Dataflow ReadFromPubsub 丢失数据 - Beam/Google Cloud Dataflow ReadFromPubsub Missing Data 如何在Google Cloud Dataflow / Apache Beam中并行运行多个WriteToBigQuery? - How to run multiple WriteToBigQuery parallel in google cloud dataflow / apache beam? 什么是为Google Cloud Dataflow部署和管理Python SDK Apache Beam管道执行的便捷方法 - What is a convenient way to deploy and manage execution of a Python SDK Apache Beam pipeline for Google cloud Dataflow 使用 apache 梁/谷歌云数据流读取多行 JSON - Read multiline JSON using apache beam / google cloud dataflow 更新 apache-beam-dataflow 和 google-cloud-bigquery 的指南 - Guidelines on updating apach-beam-dataflow and google-cloud-bigquery 使用 Apache Beam python 创建谷歌云数据流模板时出现 RuntimeValueProviderError - RuntimeValueProviderError when creating a google cloud dataflow template with Apache Beam python 带有 Apache Beam 的 Google Cloud Dataflow 不显示日志 - Google Cloud Dataflow with Apache Beam does not display log
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM