带有 python 的 GCP 数据流。 “AttributeError：无法在模块 'dataflow_worker.start' 上获取属性 '_JsonSink'

Question

I am new in GCP dataflow.我是 GCP 数据流的新手。

I try to read text files(one-line JSON string) into JSON format from GCP cloud storage, then split it based on values of certain field and output to GCP cloud storage (as JSON string text file). I try to read text files(one-line JSON string) into JSON format from GCP cloud storage, then split it based on values of certain field and output to GCP cloud storage (as JSON string text file).

Here is my code这是我的代码

However, I encounter some error on GCP dataflow:但是，我在 GCP 数据流上遇到了一些错误：

Traceback (most recent call last):
  File "main.py", line 169, in <module>
    run()
  File "main.py", line 163, in run
    shard_name_template='')
  File "C:\ProgramData\Miniconda3\lib\site-packages\apache_beam\pipeline.py", line 426, in __exit__
    self.run().wait_until_finish()
  File "C:\ProgramData\Miniconda3\lib\site-packages\apache_beam\runners\dataflow\dataflow_runner.py", line 1346, in wait_until_finish
    (self.state, getattr(self._runner, 'last_error_msg', None)), self)
apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line 773, in run
    self._load_main_session(self.local_staging_directory)
  File "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line 489, in _load_main_session
    pickler.load_session(session_file)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/internal/pickler.py", line 287, in load_session
    return dill.load_session(file_path)
  File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 410, in load_session
    module = unpickler.load()
  File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 474, in find_class
    return StockUnpickler.find_class(self, module, name)
AttributeError: Can't get attribute '_JsonSink' on <module 'dataflow_worker.start' from '/usr/local/lib/python3.7/site-packages/dataflow_worker/start.py'>

I am able to run this script locally, but it fails when I try to use dataflowRunner我可以在本地运行这个脚本，但是当我尝试使用dataflowRunner时它失败了

Please give me some suggestions.请给我一些建议。

PS. PS。 apache-beam version: 2.15.0阿帕奇光束版本：2.15.0

[Update1] [更新1]

I try @Yueyang Qiu suggestion, add我试试@Yueyang Qiu的建议，添加

pipeline_options.view_as(SetupOptions).save_main_session = True

The provided link says:提供的链接说：

DoFn's in this workflow relies on global context (eg, a module imported at module level)此工作流程中的 DoFn 依赖于全局上下文（例如，在模块级别导入的模块）

This link supports the suggestion above.这个链接支持上面的建议。

However, the same error occurred.然而，同样的错误发生了。

So, I am thinking whether my implementation of _JsonSink (inherit from filebasedsink.FileBasedSink) is wrong or something else needed to be added.所以，我在想我的 _JsonSink 实现（继承自 filebasedsink.FileBasedSink）是错误的还是需要添加其他东西。

Any opinion would be appreciated, thank you all!任何意见将不胜感激，谢谢大家！

Answer 1

You have encountered a known issue that currently (as of 2.17.0 release), Beam does not support super() calls in main module on Python 3. Please take a look at possible solutions in BEAM-6158 .您遇到了一个已知问题，目前（截至 2.17.0 版本），Beam 不支持 Python 3 主模块中的super()调用。请查看BEAM-6158中的可能解决方案。 Udi's answer is a good way to address this until BEAM-6158 is resolved, this way you don't have to run your pipeline on Python 2.在解决 BEAM-6158 之前，Udi 的回答是解决此问题的好方法，这样您就不必在 Python 2 上运行管道。

Answer 2

Using the guidelines from here , I managed get your example to run.使用此处的指南，我设法让您的示例运行。

Directory structure:目录结构：

./setup.py
./dataflow_json
./dataflow_json/dataflow_json.py  (no change from your example)
./dataflow_json/__init__.py  (empty file)
./main.py

setup.py:设置.py：

import setuptools

setuptools.setup(
  name='dataflow_json',
  version='1.0',
  install_requires=[],
  packages=setuptools.find_packages(),
)

main.py:主要.py：

from __future__ import absolute_import

from dataflow_json import dataflow_json

if __name__ == '__main__':
    dataflow_json.run()

and you run the pipeline with python main.py .然后使用python main.py运行管道。

Basically what's happening is that the '--setup_file=./setup.py' flag tells Beam to create a package and install it on the Dataflow remote worker.基本上发生的事情是'--setup_file=./setup.py'标志告诉 Beam 创建一个 package 并将其安装在 Dataflow 远程工作人员上。 The __init__.py file is required for setuptools to identify the dataflow_json/ directory as a package. setuptools 需要__init__.py文件将dataflow_json/目录识别为 package。

Answer 3

I finally find out the problem:我终于发现了问题：

the class '_jsonsink' I implement using some features form Python3我使用 Python3 的一些功能实现的 class '_jsonsink'

However, I do not aware of what version of Python I am using for 'Dataflowrunner' (Actually, I have not figured out how to specify the python version for dataflow runner on GCP. Any suggestions?)但是，我不知道我将哪个版本的 Python 用于“Dataflowrunner”（实际上，我还没有弄清楚如何为 GCP 上的数据流运行器指定 python 版本。有什么建议吗？）

Hence, I re-write my code to Python2-compatible version, everything works fine!因此，我将代码重新编写为 Python2 兼容版本，一切正常！

Thanks for all of you!谢谢大家！

Answer 4

Can you try setting option save_main_session = True as in here: https://github.com/apache/beam/blob/a2b0ad14f1525d1a645cb26f5b8ec45692d9d54e/sdks/python/apache_beam/examples/cookbook/coders.py#L88 .您可以尝试在此处设置选项save_main_session = True吗： https://github.com/apache/beam/blob/a2b0ad14f1525d1a645cb26f5b8ec45692d9d54e/sdks/python/apache_beam/examples/cookbook/coders.py#L88

带有 python 的 GCP 数据流。 “AttributeError：无法在模块 'dataflow_worker.start' 上获取属性 '_JsonSink'

问题描述

4 个解决方案

解决方案1
9 2019-11-22 01:58:13

解决方案2
6 2019-11-13 21:23:45

解决方案3
1 2019-11-15 02:33:35

解决方案4
0 2019-10-18 07:14:49

带有 python 的 GCP 数据流。 “AttributeError：无法在模块 'dataflow_worker.start' 上获取属性 '_JsonSink'

问题描述

4 个解决方案

解决方案1 9 2019-11-22 01:58:13

解决方案2 6 2019-11-13 21:23:45

解决方案3 1 2019-11-15 02:33:35

解决方案4 0 2019-10-18 07:14:49

解决方案1
9 2019-11-22 01:58:13

解决方案2
6 2019-11-13 21:23:45

解决方案3
1 2019-11-15 02:33:35

解决方案4
0 2019-10-18 07:14:49