繁体   English   中英

从 Cloud 触发数据流作业时 dill 出错 Function

[英]Error with dill when triggering a Data Flow job from Cloud Function

问题

我正在编写一个 GCP 云 function,它从 pubsub 消息、进程和 output 表中获取一个输入 ID 到 BigQuery。

代码如下:

from __future__ import absolute_import
import base64
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from scrapinghub import ScrapinghubClient
import os


def processing_data_function():
    # do stuff and return desired data 

def create_data_from_id():
    # take scrapinghub's job id and extract the data through api 

def run(event, context):
    """Triggered from a message on a Cloud Pub/Sub topic.
    Args:
         event (dict): Event payload.
         context (google.cloud.functions.Context): Metadata for the event.
    """
    # Take pubsub message and also Scrapinghub job's input id 
    pubsub_message = base64.b64decode(event['data']).decode('utf-8')  

    agrv = ['--project=project-name', 
            '--region=us-central1', 
            '--runner=DataflowRunner', 
            '--temp_location=gs://temp/location/', 
            '--staging_location=gs://staging/location/']
    p = beam.Pipeline(options=PipelineOptions(agrv))
    (p
        | 'Read from Scrapinghub' >> beam.Create(create_data_from_id(pubsub_message))
        | 'Trim b string' >> beam.FlatMap(processing_data_function)
        | 'Write Projects to BigQuery' >> beam.io.WriteToBigQuery(
                'table_name',
                schema=schema,
                # Creates the table in BigQuery if it does not yet exist.
                create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
                write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
    )
    p.run()


if __name__ == '__main__':
    run()

请注意,2 个函数create_data_from_idprocessing_data_function处理来自 Scrapinghub(scrapy 的一个抓取站点)的数据,它们非常冗长,所以我不想在这里包含它们。 它们也与错误无关,因为如果我从云 shell 运行此代码并使用argparse.ArgumentParser()传递 arguments,则此代码有效。

关于我遇到的错误,虽然部署代码没有问题并且pubsub消息可以成功触发function,但数据流作业失败并报告此错误:

"Error message from worker: Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/apache_beam/internal/pickler.py", line 279, in loads
    return dill.loads(s)
  File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 275, in loads
    return load(file, ignore, **kwds)
  File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 270, in load
    return Unpickler(file, ignore=ignore, **kwds).load()
  File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 472, in load
    obj = StockUnpickler.load(self)
  File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 826, in _import_module
    return __import__(import_name)
ModuleNotFoundError: No module named 'main'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line 649, in do_work
    work_executor.execute()
  File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", line 179, in execute
    op.start()
  File "apache_beam/runners/worker/operations.py", line 662, in apache_beam.runners.worker.operations.DoOperation.start
  File "apache_beam/runners/worker/operations.py", line 664, in apache_beam.runners.worker.operations.DoOperation.start
  File "apache_beam/runners/worker/operations.py", line 665, in apache_beam.runners.worker.operations.DoOperation.start
  File "apache_beam/runners/worker/operations.py", line 284, in apache_beam.runners.worker.operations.Operation.start
  File "apache_beam/runners/worker/operations.py", line 290, in apache_beam.runners.worker.operations.Operation.start
  File "apache_beam/runners/worker/operations.py", line 611, in apache_beam.runners.worker.operations.DoOperation.setup
  File "apache_beam/runners/worker/operations.py", line 616, in apache_beam.runners.worker.operations.DoOperation.setup
  File "/usr/local/lib/python3.7/site-packages/apache_beam/internal/pickler.py", line 283, in loads
    return dill.loads(s)
  File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 275, in loads
    return load(file, ignore, **kwds)
  File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 270, in load
    return Unpickler(file, ignore=ignore, **kwds).load()
  File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 472, in load
    obj = StockUnpickler.load(self)
  File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 826, in _import_module
    return __import__(import_name)
ModuleNotFoundError: No module named 'main'

我试过的

鉴于我可以从云 shell 运行相同的管道,但使用参数解析器而不是指定选项,我认为选项说明的方式是问题所在。 因此,我尝试了不同的选项组合,有或没有--save_main_session--staging_location--requirement_file=requirements.txt--setup_file=setup.py ...他们都或多或少地报告了同样的问题,所有 dill 都不知道该选择哪个模块。 指定save_main_session后,主 session 无法被拾取。 由于指定了 requirement_file 和 setup_file,作业甚至没有成功创建,因此我可以省去您查看其错误的麻烦。 我的主要问题是我不知道这个问题是从哪里来的,因为我以前从未使用过 dill 以及为什么从 shell 和云函数运行管道如此不同? 有人知道吗?

谢谢

您也可以尝试将最后一部分修改为并测试以下是否有效:

if __name__ == "__main__":
    ...

此外,请确保您在正确的文件夹中执行脚本,因为它可能与文件在目录中的命名或位置有关。

请考虑以下来源,它们可能对您有所帮助:来源 1来源 2

我希望这个信息帮助。

您可能正在使用 gunicorn 在 Cloud Run 上启动应用程序(作为标准做法),例如:

CMD exec gunicorn --bind:$PORT --workers 1 --threads 8 --timeout 0 main:app

我遇到了同样的问题,并找到了在没有 gunicorn 的情况下启动应用程序的解决方法:

CMD exec python3 main.py

可能是因为gunicorn跳过了main context,直接启动了main:app object。我不知道如何用gunicorn修复它。

=== 补充说明 ===

我找到了一种使用 gunicorn 的方法。

  1. 将 function(启动管道)移动到另一个模块,例如df_pipeline/pipe.py
.
├── df_pipeline
│   ├── __init__.py
│   └── pipe.py
├── Dockerfile
├── main.py
├── requirements.txt
└── setup.py
# in main.py
import df_pipeline as pipe
result = pipe.preprocess(....)
  1. 在与main.py相同的目录中创建setup.py
# setup.py
import setuptools
setuptools.setup(
    name='df_pipeline',
    install_requires=[],
    packages=setuptools.find_packages(include=['df_pipeline']),
)
  1. df_pipeline/pipe.py管道选项setup_file设置为./setup.py

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM