繁体   English   中英

从 PubSub 读取的数据流在 GCP 上工作,无法在本地运行

[英]Dataflow reading from PubSub works at GCP, can't run locally

我有一个小型测试数据流作业,它只是从 PubSub 订阅中读取并丢弃我们用来开始一些概念验证工作的消息。

它在 GCP 上运行良好,但在本地失败。 我的期望是相同的代码应该以任何一种方式工作,只需切换 Dataflow 运行器,但也许情况并非如此? 这是代码:

import os
from datetime import datetime
import logging

from apache_beam import Map, io, Pipeline
from apache_beam.options.pipeline_options import PipelineOptions

def noop(element):
    pass

def run(input_subscription, pipeline_args=None):
    pipeline_options = PipelineOptions(
        pipeline_args, streaming=True, save_main_session=True
    )

    with Pipeline(options=pipeline_options) as pipeline:
        (
            pipeline
            | "Read from Pub/Sub" >> io.ReadFromPubSub(subscription=input_subscription, with_attributes=True)
            | "noop" >> Map(noop)
        )


if __name__ == "__main__":
    logging.getLogger().setLevel(logging.INFO)

    run(
        os.environ['INPUT_SUBSCRIPTION'],
        [
            '--runner', os.getenv('RUNNER', 'DirectRunner'),
            '--project', os.getenv('PROJECT'),
            '--region', os.getenv('REGION'),
            '--temp_location', os.getenv('TEMP_LOCATION'),
            '--service_account_email', os.getenv('SERVICE_ACCOUNT_EMAIL'),
            '--network', os.getenv('NETWORK'),
            '--subnetwork', os.getenv('SUBNETWORK'),
            '--num_workers', os.getenv('NUM_WORKERS'),
        ]
    )

如果我使用此命令行运行它,它会在 Google Cloud 中创建并运行该作业:

INPUT_SUBSCRIPTION=subscriptionname \
RUNNER=DataflowRunner \
PROJECT=project \
REGION=region \
TEMP_LOCATION=gs://somewhere/temp \
SERVICE_ACCOUNT_EMAIL=serviceaccount@project.iam.gserviceaccount.com \
NETWORK=network \
SUBNETWORK=https://www.googleapis.com/compute/v1/projects/project/regions/region/subnetworks/subnetwork \
NUM_WORKERS=3 \
python read-pubsub-with-dataflow.py

如果我省略RUNNER选项,那么它使用DirectRunner

INPUT_SUBSCRIPTION=subscriptionname \
PROJECT=project \
REGION=region \
TEMP_LOCATION=gs://somewhere/temp \
SERVICE_ACCOUNT_EMAIL=serviceaccount@project.iam.gserviceaccount.com \
NETWORK=network \
SUBNETWORK=https://www.googleapis.com/compute/v1/projects/project/regions/region/subnetworks/subnetwork \
NUM_WORKERS=3 \
python read-pubsub-with-dataflow.py

它失败并出现大量错误消息,但我只包括第一个(我认为 rest 只是级联):

INFO:apache_beam.runners.direct.direct_runner:Running pipeline with DirectRunner.
/Users/denis/redacted/env/lib/python3.6/site-packages/google/auth/_default.py:70: UserWarning: Your application has authenticated using end user credentials from Google Cloud SDK without a quota project. You might receive a "quota exceeded" or "API not enabled" error. We recommend you rerun `gcloud auth application-default login` and make sure a quota project is added. Or you can use service accounts instead. For more information about service accounts, see https://cloud.google.com/docs/authentication/
  warnings.warn(_CLOUD_SDK_CREDENTIALS_WARNING)
ERROR:apache_beam.runners.direct.executor:Exception at bundle <apache_beam.runners.direct.bundle_factory._Bundle object at 0x7fed3e368448>, due to an exception.
 Traceback (most recent call last):
  File "/Users/denis/redacted/env/lib/python3.6/site-packages/apache_beam/runners/direct/transform_evaluator.py", line 694, in _read_from_pubsub
    self._sub_name, max_messages=10, return_immediately=True)
  File "/Users/denis/redacted/env/lib/python3.6/site-packages/google/cloud/pubsub_v1/_gapic.py", line 40, in <lambda>
    fx = lambda self, *a, **kw: wrapped_fx(self.api, *a, **kw)  # noqa
  File "/Users/denis/redacted/env/lib/python3.6/site-packages/google/pubsub_v1/services/subscriber/client.py", line 1106, in pull
    "If the `request` argument is set, then none of "
ValueError: If the `request` argument is set, then none of the individual field arguments should be set.

During handling of the above exception, another exception occurred:
...etc...

我怀疑这可能与凭据有关? 还是我们的项目配置? 也许我应该尝试一个新的空白项目。

事实证明,这与 package 版本不兼容。 我的requirements.txt是:

apache_beam[gcp]
google_apitools
google-cloud-pubsub

但那是安装破坏apache_beamgoogle-cloud-pubsub package 版本。 我将requirements.txt更改为:

apache_beam[gcp]
google_apitools

现在一切正常!

对于它的价值,使用DirectRunner在本地运行我显然不需要DataflowRunner所需的很多选项。 这足够了:

GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json \
RUNNER=DirectRunner \
INPUT_SUBSCRIPTION=projects/mytopic/subscriptions/mysubscription \
python read-pubsub-with-dataflow.py

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM