简体   繁体   English

将 GenericRecords 的 pCollection 写入 Parquet 文件的数据流

[英]Dataflow writing a pCollection of GenericRecords to Parquet files

In apache beam step I have a PCollection of KV<String, Iterable<KV<Long, GenericRecord>>>> .在 apache 光束步骤中,我有一个 PCollection 的KV<String, Iterable<KV<Long, GenericRecord>>>> I want to write all the records in the iterable to the same parquet file.我想将迭代中的所有记录写入同一个镶木地板文件。 My code snippet is given below我的代码片段如下

p.apply(ParDo.of(new MapWithAvroSchemaAndConvertToGenericRecord())) // PCollection<GenericRecord>
.apply(ParDo.of(new MapKafkaGenericRecordValue(formatter, options.getFileNameDelimiter()))) //PCollection<KV<String, KV<Long, GenericRecord>>>
.apply(GroupByKey.create()) //PCollection<KV<String, Iterable<KV<Long, GenericRecord>>>>>

now I want to write all the Records in the Iterable in the same parquet file(derive the file name by the key of KV).现在我想将 Iterable 中的所有记录写入同一个 parquet 文件中(通过 KV 的键导出文件名)。

I found out the solution to the problem.我找到了解决问题的方法。 at the step -在步骤 -

apply(GroupByKey.create()) //PCollection<KV<String, Iterable<KV<Long, GenericRecord>>>>>

I will apply another transform that will return only the Iterable as the output pCollection.我将应用另一个转换,它只返回 Iterable 作为输出 pCollection。 `.apply(ParDo.of(new GetIterable())) //PCollection>> where key is the name of the file I have to write to. `.apply(ParDo.of(new GetIterable())) //PCollection>> 其中 key 是我必须写入的文件的名称。 then remaining snippet is然后剩下的片段是

.apply(Flatten.iterables())
                .apply(
                        FileIO.<String, KV<String, GenericRecord>>writeDynamic()
                                .by((SerializableFunction<KV<String, GenericRecord>, String>) KV::getKey)
                                .via(
                                        Contextful.fn(
                                                (SerializableFunction<KV<String, GenericRecord>, GenericRecord>) KV::getValue
                                        ),
                                        ParquetIO.sink(schema)
                                                .withCompressionCodec(CompressionCodecName.SNAPPY)


                                )

                                .withTempDirectory("/tmp/temp-beam")
                                .to(options.getGCSBucketUrl())
                                .withNumShards(1)
                                .withDestinationCoder(StringUtf8Coder.of())
                )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 并行编写多个 parquet 文件 - Writing multiple parquet files in parallel Google数据流:PCollection <String> 到PCollection <TableRow> 用于BigQuery插入 - Google Dataflow: PCollection<String> to PCollection<TableRow> for BigQuery insertion 使用AWS Java Lambda将Parquet文件写入S3 - Writing parquet files to S3 using AWS java lamda 有没有办法在 Beam 的 ParDo 转换中创建 SpecificRecord 列表以编写 Parquet 文件? - Is there a way to create a list of SpecificRecord in a ParDo transformation in Beam for writing Parquet files? Beam / Dataflow 2.2.0-从pcollection中提取前n个元素 - Beam/Dataflow 2.2.0 - extract first n elements from pcollection Google Dataflow:从 Google Cloud Storage 读取未绑定的 PCollection - Google Dataflow: Read unbound PCollection from Google Cloud Storage 在Google Cloud Dataflow中将TextIO.Write与复杂的PCollection类型结合使用 - Using TextIO.Write with a complicated PCollection type in Google Cloud Dataflow 在 GCS 中读取 Avro 文件作为 PCollection<genericrecord></genericrecord> - Reading Avro files in GCS as PCollection<GenericRecord> 最佳实践-使用受限服务帐户将无限制的PCollection写入GCS桶 - Best Practice - Writing Unbounded PCollection to GCS Bucket with restricted Service Account 编写镶木地板文件时出现问题 - Issue while writing a parquet file
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM