简体   繁体   English

有没有办法在 Beam 的 ParDo 转换中创建 SpecificRecord 列表以编写 Parquet 文件?

[英]Is there a way to create a list of SpecificRecord in a ParDo transformation in Beam for writing Parquet files?

I am trying to write a Dataflow job in Beam/Java to process a series of events coming from Pub/Sub and writing to Parquet.我正在尝试在 Beam/Java 中编写一个 Dataflow 作业来处理来自 Pub/Sub 并写入 Parquet 的一系列事件。 The events in Pub/Sub are in JSON format, and every event can generate one or more rows. Pub/Sub 中的事件为 JSON 格式,每个事件可以生成一行或多行。 I was able to write a very simple example writing a ParDo transformation returning just 1 record.我能够编写一个非常简单的示例来编写仅返回 1 条记录的 ParDo 转换。 The ParDo looks like this ParDo 看起来像这样

    static class GenerateRecords extends DoFn<String, GenericRecord> {
        @ProcessElement
        public void processElement(ProcessContext context) {
            final GenericData.Record record = new GenericData.Record(schema);
            String msg = context.element();

            com.tsp.de.schema.mschema pRecord = GenerateParquetRecord(msg);


            context.output(pRecord);
        }
    }

and the write part of the pipeline和管道的写入部分

                .apply("Write to file",
                FileIO.<GenericRecord>
                        write()
                        .via(
                                ParquetIO.sink(schema)
                                        .withCompressionCodec(CompressionCodecName.SNAPPY)
                        )
                        .to(options.getOutputDirectory())
                        .withNumShards(options.getNumShards())
                        .withSuffix("pfile")
                );

My question is, how do I generalize this ParDo transformation to return a list of records?我的问题是,如何概括此 ParDo 转换以返回记录列表? I tried List but that does not work, the ParquetIO.sink(schema) barks at "cannot resolve method via".我尝试了 List 但这不起作用,ParquetIO.sink(schema) 在“无法解析方法通过”处咆哮。

You can invoke context.output() in your DoFn as much times as you need.您可以根据需要多次在DoFn中调用context.output() So, if you know the business logic under which circumstances you need to emit several records then you just have to call context.output(record) for every output record.因此,如果您知道在什么情况下需要发出多条记录的业务逻辑,那么您只需为每个 output 记录调用context.output(record) It should be more simple than to have a PCollection of containers.它应该比拥有容器的PCollection更简单。

PS: Btw, I have a simple example of how to write GenericRecord s with ParquetIO and AvroCoder that perhaps could be helpful. PS:顺便说一句,我有一个简单的例子,说明如何使用ParquetIOAvroCoder编写GenericRecord ,这可能会有所帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM