Google Cloud Dataflow - 来自 PubSub 消息的 Pyarrow 架构

Question

I'm trying to write Google PubSub messages to Google Cloud Storage using Google Cloud Dataflow (Python SDK).我正在尝试使用 Google Cloud Dataflow（Python SDK）将 Google PubSub 消息写入 Google Cloud Storage。 The messages come into PubSub in json format and I have to define a schema in order to write them into parquet format in Google Cloud Storage.消息以 json 格式进入 PubSub，我必须定义一个模式才能将它们写入 Google Cloud Storage 中的镶木地板格式。

As suggested from other users I start working on this task by particularly looking at this and this sources.正如其他用户所建议的那样，我开始通过特别查看这个和这个来源来完成这项任务。
The first one is not exactly what I want to do because it applies changes to the json files (it merges them through a window, put the original json into a field "message" and adds a timestamp representing the time of publication).第一个不是我想要做的，因为它将更改应用于 json 文件（它通过 window 将它们合并，将原始 json 表示为字段“消息。发布时间”并添加时间戳）
The second one source (source code here ) fits better to the use case.第二个来源（这里的源代码）更适合用例。 Specifically, a schema is automatically defined from data extracted from a table in BigQuery and then write the results back to Google Cloud Storage in parquet format.具体来说，架构是根据从 BigQuery 中的表中提取的数据自动定义的，然后以 parquet 格式将结果写回 Google Cloud Storage。
Does anyone know if it is possible to do the same, more precisely to automatically define a schema using pyarrow by reading json messages from PubSub?有谁知道是否可以这样做，更准确地说是通过从 PubSub 读取 json 消息来使用 pyarrow 自动定义模式？ If it is possible, how can I do it?如果有可能，我该怎么做？

This is what I've done so far.这是我到目前为止所做的。 If I try to run it some parquet files are generated (they contain the columns name I specified through pyarrow schema, but they have no values), and several errors are generated from the Dataflow console (see one example below).如果我尝试运行它，会生成一些 parquet 文件（它们包含我通过 pyarrow 模式指定的列名，但它们没有值），并且从 Dataflow 控制台生成了几个错误（参见下面的一个示例）。 In addition, if only one json file arrives in PubSub (which should be converted to a parquet file with only one line), I don't understand why many parquet files are generated (more than 10 if I leave the job running for a couple of minutes).另外，如果只有一个 json 文件到达 PubSub （应该转换为只有一行的 parquet 文件），我不明白为什么会生成许多 parquet 文件（如果我让工作运行几个分钟）。


    import argparse
    import logging
    import pyarrow
    
    import apache_beam as beam
    from apache_beam.options.pipeline_options import PipelineOptions
    
    def run(input_topic, output_path, pipeline_args=None):
        # TODO - Dynamic parquet_schema definition
        # input_topic = known_args.input
        # parquet_schema = get_parquet_schema(input_topic)
    
        parquet_schema = pyarrow.schema(
            [('Attr1', pyarrow.string()), ('Attr2', pyarrow.string()),
             ('Attr3', pyarrow.string()), ('Attr4', pyarrow.string()),
             ('Attr5', pyarrow.string()), ('Attr6', pyarrow.string())
             ]
        )
    
        # instantiate a pipeline with all the pipeline option
        pipeline_options = PipelineOptions(pipeline_args, streaming=True)
    
        # processing and structure of pipeline
        with beam.Pipeline(options=pipeline_options) as pipeline:
            (
                pipeline
                | 'Input: Read PubSub Messages' >> beam.io.ReadFromPubSub(topic=input_topic)
                | 'Output: Export to Parquet' >> beam.io.parquetio.WriteToParquet(
                    file_path_prefix=output_path,
                    schema=parquet_schema,
                    file_name_suffix='.parquet')
            )
    
    
    if __name__ == '__main__':
        logging.getLogger().setLevel(logging.INFO)
    
        parser = argparse.ArgumentParser()
        parser.add_argument('--input_topic',
                            help='input pubsub topic to read data.',)
        parser.add_argument('--output_path',
                            help='gcs output location for parquet files.',)
        known_args, pipeline_args = parser.parse_known_args()
    
        run(
            known_args.input_topic,
            known_args.output_path,
            pipeline_args,
        )

This is the error that is generated from dataflow:这是从数据流生成的错误：


    Error message from worker: 
    java.util.concurrent.ExecutionException: java.lang.RuntimeException: Error received from SDK harness for instruction -1018: Traceback (most recent call last): 
    File "apache_beam/runners/common.py", line 961, in apache_beam.runners.common.DoFnRunner.process 
    File "apache_beam/runners/common.py", line 726, in apache_beam.runners.common.PerWindowInvoker.invoke_process 
    File "apache_beam/runners/common.py", line 814, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window 
    File "/usr/local/lib/python3.7/site-packages/apache_beam/io/iobase.py", line 1061, in process self.writer.write(element) 
    File "/usr/local/lib/python3.7/site-packages/apache_beam/io/filebasedsink.py", line 420, in write self.sink.write_record(self.temp_handle, value) 
    File "/usr/local/lib/python3.7/site-packages/apache_beam/io/parquetio.py", line 534, in write_record self._buffer[i].append(value[n]) 
    TypeError: byte indices must be integers or slices, not str 
    
    During handling of the above exception, another exception occurred: 
    
    Traceback (most recent call last): 
    File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 245, in _execute response = task() 
    File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 302, in <lambda> lambda: self.create_worker().do_instruction(request), request) 
    File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 471, in do_instruction getattr(request, request_type), request.instruction_id) 
    File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 506, in process_bundle bundle_processor.process_bundle(instruction_id)) 
    File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/bundle_processor.py", line 972, in process_bundle element.data) 
    File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/bundle_processor.py", line 218, in process_encoded self.output(decoded_value) 
    File "apache_beam/runners/worker/operations.py", line 330, in apache_beam.runners.worker.operations.Operation.output 
    File "apache_beam/runners/worker/operations.py", line 332, in apache_beam.runners.worker.operations.Operation.output 
    File "apache_beam/runners/worker/operations.py", line 195, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive 
    File "apache_beam/runners/worker/operations.py", line 670, in apache_beam.runners.worker.operations.DoOperation.process 
    File "apache_beam/runners/worker/operations.py", line 671, in apache_beam.runners.worker.operations.DoOperation.process 
    File "apache_beam/runners/common.py", line 963, in apache_beam.runners.common.DoFnRunner.process 
    File "apache_beam/runners/common.py", line 1045, in apache_beam.runners.common.DoFnRunner._reraise_augmented 
    File "/usr/local/lib/python3.7/site-packages/future/utils/__init__.py", line 446, in raise_with_traceback raise exc.with_traceback(traceback) 
    File "apache_beam/runners/common.py", line 961, in apache_beam.runners.common.DoFnRunner.process 
    File "apache_beam/runners/common.py", line 726, in apache_beam.runners.common.PerWindowInvoker.invoke_process 
    File "apache_beam/runners/common.py", line 814, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window 
    File "/usr/local/lib/python3.7/site-packages/apache_beam/io/iobase.py", line 1061, in process self.writer.write(element) 
    File "/usr/local/lib/python3.7/site-packages/apache_beam/io/filebasedsink.py", line 420, in write self.sink.write_record(self.temp_handle, value) 
    File "/usr/local/lib/python3.7/site-packages/apache_beam/io/parquetio.py", line 534, in write_record self._buffer[i].append(value[n]) 
    TypeError: byte indices must be integers or slices, not str [while running 'generatedPtransform-1004'] 
    
    java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) 
    java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) 
    org.apache.beam.sdk.util.MoreFutures.get(MoreFutures.java:57) 
    org.apache.beam.runners.dataflow.worker.fn.control.RegisterAndProcessBundleOperation.finish(RegisterAndProcessBundleOperation.java:333) 
    org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:85) 
    org.apache.beam.runners.dataflow.worker.fn.control.BeamFnMapTaskExecutor.execute(BeamFnMapTaskExecutor.java:123) 
    org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1369) 
    org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1100(StreamingDataflowWorker.java:154) 
    org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$7.run(StreamingDataflowWorker.java:1088) 
    java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) java.lang.Thread.run(Thread.java:748) 
    Caused by: java.lang.RuntimeException: Error received from SDK harness for instruction -1018: Traceback (most recent call last): 
    File "apache_beam/runners/common.py", line 961, in apache_beam.runners.common.DoFnRunner.process 
    File "apache_beam/runners/common.py", line 726, in apache_beam.runners.common.PerWindowInvoker.invoke_process 
    File "apache_beam/runners/common.py", line 814, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window 
    File "/usr/local/lib/python3.7/site-packages/apache_beam/io/iobase.py", line 1061, in process self.writer.write(element) 
    File "/usr/local/lib/python3.7/site-packages/apache_beam/io/
    filebasedsink.py", line 420, in write self.sink.write_record(self.temp_handle, value) 
    File "/usr/local/lib/python3.7/site-packages/apache_beam/io/parquetio.py", line 534, in write_record self._buffer[i].append(value[n]) 
    TypeError: byte indices must be integers or slices, not str 
    
    During handling of the above exception, another exception occurred: 
    
    Traceback (most recent call last): 
    File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 245, in _execute response = task() 
    File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 302, in <lambda> lambda: self.create_worker().do_instruction(request), request) 
    File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 471, in do_instruction getattr(request, request_type), request.instruction_id) 
    File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 506, in process_bundle bundle_processor.process_bundle(instruction_id)) 
    File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/bundle_processor.py", line 972, in process_bundle element.data) 
    File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/bundle_processor.py", line 218, in process_encoded self.output(decoded_value) 
    File "apache_beam/runners/worker/operations.py", line 330, in apache_beam.runners.worker.operations.Operation.output 
    File "apache_beam/runners/worker/operations.py", line 332, in apache_beam.runners.worker.operations.Operation.output 
    File "apache_beam/runners/worker/operations.py", line 195, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive 
    File "apache_beam/runners/worker/operations.py", line 670, in apache_beam.runners.worker.operations.DoOperation.process 
    File "apache_beam/runners/worker/operations.py", line 671, in apache_beam.runners.worker.operations.DoOperation.process 
    File "apache_beam/runners/common.py", line 963, in apache_beam.runners.common.DoFnRunner.process 
    File "apache_beam/runners/common.py", line 1045, in apache_beam.runners.common.DoFnRunner._reraise_augmented 
    File "/usr/local/lib/python3.7/site-packages/future/utils/__init__.py", line 446, in raise_with_traceback raise exc.with_traceback(traceback) 
    File "apache_beam/runners/common.py", line 961, in apache_beam.runners.common.DoFnRunner.process 
    File "apache_beam/runners/common.py", line 726, in apache_beam.runners.common.PerWindowInvoker.invoke_process 
    File "apache_beam/runners/common.py", line 814, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window 
    File "/usr/local/lib/python3.7/site-packages/apache_beam/io/iobase.py", line 1061, in process self.writer.write(element) 
    File "/usr/local/lib/python3.7/site-packages/apache_beam/io/
    filebasedsink.py", line 420, in write self.sink.write_record(self.temp_handle, value) 
    File "/usr/local/lib/python3.7/site-packages/apache_beam/io/parquetio.py", line 534, in write_record self._buffer[i].append(value[n]) 
    TypeError: byte indices must be integers or slices, not str [while running 'generatedPtransform-1004'] 
    
    org.apache.beam.runners.fnexecution.control.FnApiControlClient$ResponseStreamObserver.onNext(FnApiControlClient.java:177) 
    org.apache.beam.runners.fnexecution.control.FnApiControlClient$ResponseStreamObserver.onNext(FnApiControlClient.java:157) 
    org.apache.beam.vendor.grpc.v1p26p0.io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onMessage(ServerCalls.java:251) 
    org.apache.beam.vendor.grpc.v1p26p0.io.grpc.ForwardingServerCallListener.onMessage(ForwardingServerCallListener.java:33) 
    org.apache.beam.vendor.grpc.v1p26p0.io.grpc.Contexts$ContextualizedServerCallListener.onMessage(Contexts.java:76) 
    org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailableInternal(ServerCallImpl.java:309) 
    org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailable(ServerCallImpl.java:292) 
    org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext(ServerImpl.java:782) 
    org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) 
    org.apache.beam.vendor.grpc.v1p26p0.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123) 
    java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) java.lang.Thread.run(Thread.java:748)

Answer 1

Sorry this gave you such an ugly error message!抱歉，这给了您如此丑陋的错误消息！ This looks like exactly the kind of error we'll be able to catch early when more transforms have typing support (See https://beam.apache.org/blog/python-typing/ for more info).当更多的转换支持打字时，这看起来正是我们能够及早发现的那种错误（有关更多信息，请参阅https://beam.apache.org/blog/python-typing/ ）。

The ParquetIO sink expects an input PCollection with dictionary elements, but the PubSub source outputs a PCollection with bytes elements . ParquetIO 接收器需要一个带有字典元素的输入 PCollection，但 PubSub 源输出一个带有bytes元素的 PCollection 。 You'll need to add a transform that parses the payload bytes and converts it to a dictionary.您需要添加一个转换来解析有效负载bytes并将其转换为字典。 Something like this:像这样的东西：

(pipeline
  | 'Input: Read PubSub Messages' >> beam.io.ReadFromPubSub(topic=input_topic)
  | '*** Parse JSON -> dict ***' >> beam.Map(json.loads)
  | 'Output: Export to Parquet' >> beam.io.parquetio.WriteToParquet(
                    file_path_prefix=output_path,
                    schema=parquet_schema,
                    file_name_suffix='.parquet')

Google Cloud Dataflow - 来自 PubSub 消息的 Pyarrow 架构

问题描述

1 个解决方案

解决方案1
0 2020-07-24 23:50:45

Google Cloud Dataflow - 来自 PubSub 消息的 Pyarrow 架构

问题描述

1 个解决方案

解决方案1 0 2020-07-24 23:50:45

解决方案1
0 2020-07-24 23:50:45