简体   繁体   English

在数据流上运行Apache Beam管道会触发错误(DirectRunner在没有问题的情况下运行)

[英]Running Apache Beam pipeline on dataflow fires an error (DirectRunner running with no issue)

Pipeline that was running perfectly fires an error when using dataflow. 使用数据流时,正常运行的管道会触发错误。 so I tried a simple pipeline and gets the same error. 所以我尝试了一个简单的管道并得到了同样的错误。

The same pipeline will run with no issues on DirectRunner. 在DirectRunner上运行相同的管道时没有任何问题。 The execution environment is a Google-datalab. 执行环境是Google-datalab。

Please let me know if there is anything that I need to change / update in my environment or any other advice? 如果我的环境中有任何需要更改/更新的内容或任何其他建议,请告知我们?

Many thanks, e 非常感谢,e

import  apache_beam  as  beam
options = PipelineOptions()
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = 'PROJECT-ID'
google_cloud_options.job_name = 'try-debug'
google_cloud_options.staging_location = '%s/staging' % BUCKET_URL #'gs://archs4/staging'
google_cloud_options.temp_location = '%s/tmp' % BUCKET_URL #'gs://archs4/temp'
options.view_as(StandardOptions).runner = 'DataflowRunner'  

p1 = beam.Pipeline(options=options)

(p1 | 'read' >> beam.io.ReadFromText('gs://dataflow-samples/shakespeare/kinglear.txt')
    | 'write' >> beam.io.WriteToText('gs://bucket/test.txt', num_shards=1)
 )

p1.run().wait_until_finish()

will fire the following error: 将触发以下错误:

CalledProcessErrorTraceback (most recent call last)
<ipython-input-17-b4be63f7802f> in <module>()
      5  )
      6 
----> 7 p1.run().wait_until_finish()

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/pipeline.pyc in run(self, test_runner_api)
    174       finally:
    175         shutil.rmtree(tmpdir)
--> 176     return self.runner.run(self)
    177 
    178   def __enter__(self):

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/dataflow/dataflow_runner.pyc in run(self, pipeline)
    250     # Create the job
    251     result = DataflowPipelineResult(
--> 252         self.dataflow_client.create_job(self.job), self)
    253 
    254     self._metrics = DataflowMetrics(self.dataflow_client, result, self.job)

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/utils/retry.pyc in wrapper(*args, **kwargs)
    166       while True:
    167         try:
--> 168           return fun(*args, **kwargs)
    169         except Exception as exn:  # pylint: disable=broad-except
    170           if not retry_filter(exn):

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/apiclient.pyc in create_job(self, job)
    423   def create_job(self, job):
    424     """Creates job description. May stage and/or submit for remote execution."""
--> 425     self.create_job_description(job)
    426 
    427     # Stage and submit the job when necessary

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/apiclient.pyc in create_job_description(self, job)
    446     """Creates a job described by the workflow proto."""
    447     resources = dependency.stage_job_resources(
--> 448         job.options, file_copy=self._gcs_file_copy)
    449     job.proto.environment = Environment(
    450         packages=resources, options=job.options,

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/dependency.pyc in stage_job_resources(options, file_copy, build_setup_args, temp_dir, populate_requirements_cache)
    377       else:
    378         sdk_remote_location = setup_options.sdk_location
--> 379       _stage_beam_sdk_tarball(sdk_remote_location, staged_path, temp_dir)
    380       resources.append(names.DATAFLOW_SDK_TARBALL_FILE)
    381     else:

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/dependency.pyc in _stage_beam_sdk_tarball(sdk_remote_location, staged_path, temp_dir)
    462   elif sdk_remote_location == 'pypi':
    463     logging.info('Staging the SDK tarball from PyPI to %s', staged_path)
--> 464     _dependency_file_copy(_download_pypi_sdk_package(temp_dir), staged_path)
    465   else:
    466     raise RuntimeError(

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/dependency.pyc in _download_pypi_sdk_package(temp_dir)
    525       '--no-binary', ':all:', '--no-deps']
    526   logging.info('Executing command: %s', cmd_args)
--> 527   processes.check_call(cmd_args)
    528   zip_expected = os.path.join(
    529       temp_dir, '%s-%s.zip' % (package_name, version))

/usr/local/envs/py2env/lib/python2.7/site-packages/apache_beam/utils/processes.pyc in check_call(*args, **kwargs)
     42   if force_shell:
     43     kwargs['shell'] = True
---> 44   return subprocess.check_call(*args, **kwargs)
     45 
     46 

/usr/local/envs/py2env/lib/python2.7/subprocess.pyc in check_call(*popenargs, **kwargs)
    188         if cmd is None:
    189             cmd = popenargs[0]
--> 190         raise CalledProcessError(retcode, cmd)
    191     return 0
    192 

CalledProcessError: Command '['/usr/local/envs/py2env/bin/python', '-m', 'pip', 'install', '--download', '/tmp/tmpyyiizo', 'google-cloud-dataflow==2.0.0', '--no-binary', ':all:', '--no-deps']' returned non-zero exit status 2

I was able to run your job with DataflowRunner without any problem from a Jupyter notebook (not Datalab per se). 我能够使用DataflowRunner运行你的工作而没有任何问题来自Jupyter笔记本(不是Datalab本身)。

I am using the latest version (v2.6.0) of the apache_beam[gcp] Python SDK, as of this writing. apache_beam[gcp]时,我正在使用apache_beam[gcp] Python SDK的最新版本(v2.6.0)。 Could you retry with v2.6.0 instead of v2.0.0? 你可以用v2.6.0而不是v2.0.0重试吗?

Here's what I ran: 这是我跑的:

import  apache_beam  as  beam
from apache_beam.pipeline import PipelineOptions
from apache_beam.options.pipeline_options import GoogleCloudOptions
from apache_beam.options.pipeline_options import StandardOptions

BUCKET_URL = "gs://YOUR_BUCKET_HERE/test"

import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'PATH_TO_YOUR_SERVICE_ACCOUNT_JSON_CREDS'

options = PipelineOptions()
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = 'YOUR_PROJECT_ID_HERE'
google_cloud_options.job_name = 'try-debug'
google_cloud_options.staging_location = '%s/staging' % BUCKET_URL #'gs://archs4/staging'
google_cloud_options.temp_location = '%s/tmp' % BUCKET_URL #'gs://archs4/temp'
options.view_as(StandardOptions).runner = 'DataflowRunner'  

p1 = beam.Pipeline(options=options)

(p1 | 'read' >> beam.io.ReadFromText('gs://dataflow-samples/shakespeare/kinglear.txt')
    | 'write' >> beam.io.WriteToText('gs://bucket/test.txt', num_shards=1)
 )

p1.run().wait_until_finish()

And here's proof that it ran: 以下是它运行的证据: 在此输入图像描述

The job failed, as expected, because I don't have write access to 'gs://bucket/test.txt' - you can also see this in the stacktrace at the bottom left of the screenshot. 正如预期的那样,作业失败了,因为我没有'gs://bucket/test.txt'写入权限 - 你也可以在屏幕截图左下方的'gs://bucket/test.txt'看到这一点。 But, the job was successfully submitted to Google Cloud Dataflow, and it ran. 但是,该工作已成功提交给Google Cloud Dataflow,并且已运行。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在数据流上运行Apache Beam Python的奇怪的腌制错误 - Weird pickling error running Apache Beam Python on Dataflow Apache Beam 的 FileBasedSource 在 Google Dataflow 上运行管道超过 GCS 上约 240 万个文件时出现令人困惑的错误 - Confusing error on Apache Beam's FileBasedSource when running pipeline on Google Dataflow over ~2.4M files on GCS 在Google Cloud Dataflow上运行的Apache Beam管道中,从setup.py安装“ ffmpeg”包 - Installing “ffmpeg” package from setup.py in Apache Beam pipeline running on Google Cloud Dataflow 在 apache 光束 DirectRunner 中使用 KafkaIO 时出错 - Error while using KafkaIO in apache beam DirectRunner 如何将 numpy 导入在 GCP Dataflow 上运行的 Apache Beam 管道? - How do I import numpy into an Apache Beam pipeline, running on GCP Dataflow? 通过 Dataflow runner 使用 python 运行 Apache Beam 时出现错误我的导入在数据流中不起作用它正在运行 cloudshell - getting Error Running Apache Beam with python via Dataflow runner my import is not working in dataflow it working cloudshell Apache 梁流水线步骤未并行运行? (Python) - Apache Beam pipeline step not running in parallel? (Python) 在 Spark 上运行 Python Apache Beam 管道 - Running a python Apache Beam Pipeline on Spark 在嵌入式 Flinkrunner (apache_beam [GCP]) 中使用 pub/sub io 运行光束流管道 (Python) 时出错 - Error while running beam streaming pipeline (Python) with pub/sub io in embedded Flinkrunner (apache_beam [GCP]) 使用DirectRunner时,Bigquery apache光束管道“挂起” - Bigquery apache beam pipeline “hanging” when using the DirectRunner
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM