apache beam TypeError的雲發布/訂閱

Question

我想檢查我正在使用雲發布/訂閱的雲存儲中的新文件。 經過分析后，我想將其保存到另一個雲存儲中。 從這個雲存儲中，我將使用另一個pub子和數據流提供的模板將文件發送到BigQuery中。

在運行代碼時，出現以下錯誤：

Traceback (most recent call last):
  File "SentAnal.py", line 71, in <module>
    "Splitting_Elements_of_Text" >> beam.ParDo(Split()) |
  File "C:\Python27\lib\site-packages\apache_beam\io\gcp\pubsub.py", line 141, in __init__
    timestamp_attribute=timestamp_attribute)
  File "C:\Python27\lib\site-packages\apache_beam\io\gcp\pubsub.py", line 262, in __init__
    self.project, self.topic_name = parse_topic(topic)
  File "C:\Python27\lib\site-packages\apache_beam\io\gcp\pubsub.py", line 209, in parse_topic
    match = re.match(TOPIC_REGEXP, full_topic)
  File "C:\Python27\lib\re.py", line 141, in match
    return _compile(pattern, flags).match(string)
TypeError: expected string or buffer

這是我的代碼段：

from __future__ import absolute_import
import os
import logging
from google.cloud import language
from google.cloud.language import enums
from google.cloud.language import types
from datetime import datetime
import apache_beam as beam 
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
from apache_beam.options.pipeline_options import GoogleCloudOptions
from apache_beam.options.pipeline_options import StandardOptions
from apache_beam.io.textio import ReadFromText, WriteToText

dataflow_options = ['--project=*********','--job_name=*******','--temp_location=gs://**********/temp','--setup_file=./setup.py']
dataflow_options.append('--staging_location=gs://********/stage')
dataflow_options.append('--requirements_file ./requirements.txt')
options=PipelineOptions(dataflow_options)
gcloud_options=options.view_as(GoogleCloudOptions)


# Dataflow runner
options.view_as(StandardOptions).runner = 'DataflowRunner'
options.view_as(SetupOptions).save_main_session = True

class UserOptions(PipelineOptions):
    @classmethod
    def _add_argparse_args(cls, parser):
        source_date=datetime.now().strftime("%Y%m%d-%H%M%S")
        parser.add_value_provider_argument('--input_topic',help=('Input PubSub topic of the form '
        '"projects/*****/topics/*****".'))
        parser.add_value_provider_argument('--output_topic',help=('Input PubSub topic of the form '
        '"projects/**********/topics/*******".'))

class Split(beam.DoFn):
    def process(self,element):
        element = element.rstrip("\n").encode('utf-8')
        text = element.split(',') 
        result = []
        for i in range(len(text)):
            dat = text[i]
            #print(dat)
            client = language.LanguageServiceClient()
            document = types.Document(content=dat,type=enums.Document.Type.PLAIN_TEXT)
            sent_analysis = client.analyze_sentiment(document=document)
            sentiment = sent_analysis.document_sentiment
            data = [
            (dat,sentiment.score)
            ] 
            result.append(data)
        return result

class WriteToCSV(beam.DoFn):
    def process(self, element):
        return [
            "{},{}".format(
                element[0][0],
                element[0][1]
            )
        ]
user_options = options.view_as(UserOptions)

with beam.Pipeline(options=options) as p:
    rows = (p
         | beam.io.ReadFromPubSub(topic=user_options.input_topic)
                .with_output_types(bytes) |
        "Splitting_Elements_of_Text" >> beam.ParDo(Split()) |
           beam.io.WriteToPubSub(topic=user_options.output_topic)
    )

Answer 1

問題是您已經從PubSub讀取了字節，並嘗試在字節元素上使用正則表達式。 您首先需要將其轉換為某種字符串元素。

如果您參考流字計數示例，特別是文件streaming_wordcount.py，您會看到它們將從PubSub讀取的字節解碼為一個unicode字符串，如下所示：

messages = (p
            | beam.io.ReadFromPubSub(
                subscription=known_args.input_subscription)
            .with_output_types(bytes))

lines = messages | 'decode' >> beam.Map(lambda x: x.decode('utf-8'))

然后他們對解碼后的lines進一步的文本處理。

apache beam TypeError的雲發布/訂閱

問題描述

1 個解決方案

解決方案1
0 已采納 2019-03-07 10:02:30

apache beam TypeError的雲發布/訂閱

問題描述

1 個解決方案

解決方案1 0 已采納 2019-03-07 10:02:30

解決方案1
0 已采納 2019-03-07 10:02:30