簡體   English   中英

從 PubSub 讀取的數據流在 GCP 上工作,無法在本地運行

[英]Dataflow reading from PubSub works at GCP, can't run locally

我有一個小型測試數據流作業,它只是從 PubSub 訂閱中讀取並丟棄我們用來開始一些概念驗證工作的消息。

它在 GCP 上運行良好,但在本地失敗。 我的期望是相同的代碼應該以任何一種方式工作,只需切換 Dataflow 運行器,但也許情況並非如此? 這是代碼:

import os
from datetime import datetime
import logging

from apache_beam import Map, io, Pipeline
from apache_beam.options.pipeline_options import PipelineOptions

def noop(element):
    pass

def run(input_subscription, pipeline_args=None):
    pipeline_options = PipelineOptions(
        pipeline_args, streaming=True, save_main_session=True
    )

    with Pipeline(options=pipeline_options) as pipeline:
        (
            pipeline
            | "Read from Pub/Sub" >> io.ReadFromPubSub(subscription=input_subscription, with_attributes=True)
            | "noop" >> Map(noop)
        )


if __name__ == "__main__":
    logging.getLogger().setLevel(logging.INFO)

    run(
        os.environ['INPUT_SUBSCRIPTION'],
        [
            '--runner', os.getenv('RUNNER', 'DirectRunner'),
            '--project', os.getenv('PROJECT'),
            '--region', os.getenv('REGION'),
            '--temp_location', os.getenv('TEMP_LOCATION'),
            '--service_account_email', os.getenv('SERVICE_ACCOUNT_EMAIL'),
            '--network', os.getenv('NETWORK'),
            '--subnetwork', os.getenv('SUBNETWORK'),
            '--num_workers', os.getenv('NUM_WORKERS'),
        ]
    )

如果我使用此命令行運行它,它會在 Google Cloud 中創建並運行該作業:

INPUT_SUBSCRIPTION=subscriptionname \
RUNNER=DataflowRunner \
PROJECT=project \
REGION=region \
TEMP_LOCATION=gs://somewhere/temp \
SERVICE_ACCOUNT_EMAIL=serviceaccount@project.iam.gserviceaccount.com \
NETWORK=network \
SUBNETWORK=https://www.googleapis.com/compute/v1/projects/project/regions/region/subnetworks/subnetwork \
NUM_WORKERS=3 \
python read-pubsub-with-dataflow.py

如果我省略RUNNER選項,那么它使用DirectRunner

INPUT_SUBSCRIPTION=subscriptionname \
PROJECT=project \
REGION=region \
TEMP_LOCATION=gs://somewhere/temp \
SERVICE_ACCOUNT_EMAIL=serviceaccount@project.iam.gserviceaccount.com \
NETWORK=network \
SUBNETWORK=https://www.googleapis.com/compute/v1/projects/project/regions/region/subnetworks/subnetwork \
NUM_WORKERS=3 \
python read-pubsub-with-dataflow.py

它失敗並出現大量錯誤消息,但我只包括第一個(我認為 rest 只是級聯):

INFO:apache_beam.runners.direct.direct_runner:Running pipeline with DirectRunner.
/Users/denis/redacted/env/lib/python3.6/site-packages/google/auth/_default.py:70: UserWarning: Your application has authenticated using end user credentials from Google Cloud SDK without a quota project. You might receive a "quota exceeded" or "API not enabled" error. We recommend you rerun `gcloud auth application-default login` and make sure a quota project is added. Or you can use service accounts instead. For more information about service accounts, see https://cloud.google.com/docs/authentication/
  warnings.warn(_CLOUD_SDK_CREDENTIALS_WARNING)
ERROR:apache_beam.runners.direct.executor:Exception at bundle <apache_beam.runners.direct.bundle_factory._Bundle object at 0x7fed3e368448>, due to an exception.
 Traceback (most recent call last):
  File "/Users/denis/redacted/env/lib/python3.6/site-packages/apache_beam/runners/direct/transform_evaluator.py", line 694, in _read_from_pubsub
    self._sub_name, max_messages=10, return_immediately=True)
  File "/Users/denis/redacted/env/lib/python3.6/site-packages/google/cloud/pubsub_v1/_gapic.py", line 40, in <lambda>
    fx = lambda self, *a, **kw: wrapped_fx(self.api, *a, **kw)  # noqa
  File "/Users/denis/redacted/env/lib/python3.6/site-packages/google/pubsub_v1/services/subscriber/client.py", line 1106, in pull
    "If the `request` argument is set, then none of "
ValueError: If the `request` argument is set, then none of the individual field arguments should be set.

During handling of the above exception, another exception occurred:
...etc...

我懷疑這可能與憑據有關? 還是我們的項目配置? 也許我應該嘗試一個新的空白項目。

事實證明,這與 package 版本不兼容。 我的requirements.txt是:

apache_beam[gcp]
google_apitools
google-cloud-pubsub

但那是安裝破壞apache_beamgoogle-cloud-pubsub package 版本。 我將requirements.txt更改為:

apache_beam[gcp]
google_apitools

現在一切正常!

對於它的價值,使用DirectRunner在本地運行我顯然不需要DataflowRunner所需的很多選項。 這足夠了:

GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json \
RUNNER=DirectRunner \
INPUT_SUBSCRIPTION=projects/mytopic/subscriptions/mysubscription \
python read-pubsub-with-dataflow.py

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM