提供 pubsub 主題作為參數的數據流錯誤

Question

我有一個問題，我正在使用 python 創建數據流模板，並且該模板在啟動新的數據流作業時需要接受 3 個用戶定義的 arguments。

問題出現在 beam.io.gcp.pubsub.WriteToPubSub() 中，我嘗試從 ValueProvider 提供主題名稱，根據谷歌文檔，創建模板時需要該名稱：

https://cloud.google.com/dataflow/docs/guides/templates/creating-templates

源 beam.io.ReadFromPubSub() 與轉換 beam.io.gcp.bigquery.WriteToBigQuery() 一樣，成功接受訂閱值的值提供程序。

顯然分享我的代碼會有所幫助:)

首先是通常的進口：

from __future__ import absolute_import

import argparse
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
from apache_beam.options.pipeline_options import StandardOptions
from apache_beam.options.value_provider import StaticValueProvider
import json
import time
from datetime import datetime
import dateutil.parser
import sys

接下來是我為提供給模板的輸入 arguments 定義的 class ：

class userOptions(PipelineOptions):
    @classmethod
    def _add_argparse_args(cls, parser):
        parser.add_value_provider_argument(
            '--subscription',  
            default='projects/MYPROJECT/subscrpiptions/subscription', 
            help='PubSub subscription to listen on')
        parser.add_value_provider_argument(
            '--bqtable', 
            default='dataset.table', 
            help='Big Query Table Name in the format project:dataset.table') 
        parser.add_value_provider_argument(
            '--topic',  
            default='projects/MYPROJECT/subscrpiptions/subscription', 
            help='PubSub topic to write failed messages to')

管道本身定義為（注意我省略了 map 函數）

def run():

    user_options = PipelineOptions().view_as(userOptions)

    pipeline_options = PipelineOptions()
    pipeline_options.view_as(SetupOptions).save_main_session = True
    pipeline_options.view_as(StandardOptions).streaming = True

    with beam.Pipeline(options=pipeline_options) as p:

        records = ( 
        p  | 'Read from PubSub' 
            >> beam.io.ReadFromPubSub(
                subscription=str(user_options.subscription),
                id_label='Message_ID',
                with_attributes=True)
        | 'Format Message' >> 
            beam.Map(format_message_element)
        | 'Transform null records to empty list' >>
            beam.Map(transform_null_records)
        | 'Transform Dates' >>
            beam.Map(format_dates)
        | 'Write to Big Query' >>
            beam.io.gcp.bigquery.WriteToBigQuery(
                table=user_options.bqtable,
                create_disposition='CREATE_IF_NEEDED',
                write_disposition='WRITE_APPEND',
                insert_retry_strategy='RETRY_NEVER'
            )
        | 'Write Failures to Pub Sub' >>
            beam.io.gcp.pubsub.WriteToPubSub(user_options.topic)
        )

現在，當我嘗試使用 powershell 命令生成模板時：

python profiles-pipeline.py --project xxxx-xxxxxx-xxxx `
--subscription projects/xxxx-xxxxxx-xxxx/subscriptions/sub-xxxx-xxxxxx-xxxx-dataflow `
--bqtable xxxx-xxxxxx-xxxx:dataset.table `
--topic projects/xxxx-xxxxxx-xxxx/topics/top-xxxx-xxxxxx-xxxx-failures `
--runner DataflowRunner `
--temp_location gs://xxxx-xxxxxx-xxxx/temp/ `
--staging_location gs://xxxx-xxxxxx-xxxx/staging/ `
--template_location gs://xxxx-xxxxxx-xxxx/template

我收到此錯誤：

File "pipeline.py", line 193, in <module>
    run()
  File "pipeline.py", line 183, in run
    beam.io.gcp.pubsub.WriteToPubSub(user_options.topic)
  File "C:\github\pipeline-dataflow-jobs\dataflow\lib\site-packages\apache_beam\io\gcp\pubsub.py", line 292, in __init__
    topic, id_label, with_attributes, timestamp_attribute)
  File "C:\github\pipeline-dataflow-jobs\dataflow\lib\site-packages\apache_beam\io\gcp\pubsub.py", line 430, in __init__
    self.project, self.topic_name = parse_topic(topic)
  File "C:\github\pipeline-dataflow-jobs\dataflow\lib\site-packages\apache_beam\io\gcp\pubsub.py", line 325, in parse_topic
    match = re.match(TOPIC_REGEXP, full_topic)
  File "c:\program files\python37\lib\re.py", line 173, in match
    return _compile(pattern, flags).match(string)
TypeError: expected string or bytes-like object

我在嘗試使用 beam.io.WriteToBigQuery() 之前遇到過此錯誤，但是一旦我更改為 beam.io.gcp.bigquery.WriteToBigQuery() 錯誤就解決了，因為它接受表名的 ValueProvider。 但是對於 pubsub，我找不到有效的寫入方法。

非常感激任何的幫助。

Answer 1

我已經部分解決了這個問題，因為我的管道錯誤地嘗試將失敗的插入發布到 Big Query，但是我仍然遇到無法將 pubsub 主題名稱作為輸入參數傳遞的問題。 但是，如果主題名稱是硬編碼的，它確實有效

#################################################################
# Import the libraries required by the pipeline                 #
#################################################################
from __future__ import absolute_import

import argparse
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
from apache_beam.options.pipeline_options import StandardOptions
from apache_beam.options.value_provider import RuntimeValueProvider
import json
import time
from datetime import datetime
import dateutil.parser
import sys
import logging

#################################################################
# Create a class for the user defined settings provided at job 
# creation
#################################################################

class userOptions(PipelineOptions):
    @classmethod
    def _add_argparse_args(cls, parser):
        parser.add_value_provider_argument(
            '--subscription',  
            default='projects/MYPROJECT/subscrpiptions/subscription', 
            help='PubSub subscription to listen on')
        parser.add_value_provider_argument(
            '--bqtable', 
            default='dataset.table', 
            help='Big Query Table Name in the format project:dataset.table') 
        parser.add_value_provider_argument(
            '--topic', 
            default='projects/MYPROJECT/topics/subscription', 
            help='Pubsub topic to write failures to') 

##############################################################################
# Format failure message
##############################################################################
def format_failed_message(data):
    try:
        message=json.dumps(data)
    except:
        print("customError in function format_failed_message occured.", sys.exc_info(), "Message contents: ", data)
    return message

#################################################################
# create a function called run                                  #
#################################################################
def run():

    ##############################################################
    # Setup the pipeline options with both passed in arguments 
    # and streaming options
    ##############################################################
    user_options = PipelineOptions().view_as(userOptions)

    pipeline_options = PipelineOptions()
    pipeline_options.view_as(SetupOptions).save_main_session = True
    pipeline_options.view_as(StandardOptions).streaming = True

    ##############################################################
    # Define the pipeline
    ##############################################################
    with beam.Pipeline(options=pipeline_options) as p:

        # First we create a PCollection which will contain the messages read from Pubsub
        records = ( 
        p  | 'Read from PubSub' 
            >> beam.io.ReadFromPubSub(
                subscription=str(user_options.subscription),
                id_label='Message_ID',
                with_attributes=True)
        # Transform the message and its attributes to a dict.
        | 'Format Message' >> 
            beam.Map(format_message_element)
        # Transform the empty arrays defined as element:null to element:[].
        | 'Transform null records to empty list' >>
            beam.Map(transform_null_records)
        # Transform the dateCreated and DateModified to a big query compatible timestamp format.
        | 'Transform Dates' >>
            beam.Map(format_dates)
        # Attempt to write the rows to BQ
        | 'Write to Big Query' >>
            beam.io.gcp.bigquery.WriteToBigQuery(
                table=user_options.bqtable,
                create_disposition='CREATE_IF_NEEDED',
                write_disposition='WRITE_APPEND',
                insert_retry_strategy='RETRY_NEVER'
            )
        )

        #For any rows that failed to write to BQ
        failed_data = (records[beam.io.gcp.bigquery.BigQueryWriteFn.FAILED_ROWS]
                        #Format the dictionary to a string
                        | 'Format the dictionary as a string for publishing' >>
                            beam.Map(format_failed_message)
                        #Encode the string to utf8 bytes
                        | 'Encode the message' >>
                            beam.Map(lambda x: x.encode('utf-8')).with_output_types(bytes)
                        )
        #Published the failed rows to pubsub
        failed_data | beam.io.gcp.pubsub.WriteToPubSub(topic='projects/xxxx-xxxxx-xxxxxx/topics/top-xxxxx-failures')
        #failed_data | beam.io.gcp.pubsub.WriteToPubSub(topic=user_options.topic)

    # As this is a streaming pipeline it will run continuosly till either we 
    # stop the pipeline or it fails.
    result = p.run()
    result.wait_until_finish()

#At the main entry point call the run function
if __name__ == '__main__':
    #logging.getLogger().setLevel(logging.INFO)
    run()

Answer 2

| 'Encode bytestring' >> beam.Map(encode_byte_string) #我想這部分你已經實現了 | '寫給 pusub' >> beam.io.WriteToPubSub(output_topic) -- 它對我有用。

提供 pubsub 主題作為參數的數據流錯誤

問題描述

2 個解決方案

解決方案1
0 2020-05-06 19:46:16

解決方案2
0 2020-09-09 07:32:33

提供 pubsub 主題作為參數的數據流錯誤

問題描述

2 個解決方案

解決方案1 0 2020-05-06 19:46:16

解決方案2 0 2020-09-09 07:32:33

解決方案1
0 2020-05-06 19:46:16

解決方案2
0 2020-09-09 07:32:33