[英]Dataflow writing a pCollection of GenericRecords to Parquet files
In apache beam step I have a PCollection of KV<String, Iterable<KV<Long, GenericRecord>>>>
.在 apache 光束步骤中,我有一个 PCollection 的
KV<String, Iterable<KV<Long, GenericRecord>>>>
。 I want to write all the records in the iterable to the same parquet file.我想将迭代中的所有记录写入同一个镶木地板文件。 My code snippet is given below
我的代码片段如下
p.apply(ParDo.of(new MapWithAvroSchemaAndConvertToGenericRecord())) // PCollection<GenericRecord>
.apply(ParDo.of(new MapKafkaGenericRecordValue(formatter, options.getFileNameDelimiter()))) //PCollection<KV<String, KV<Long, GenericRecord>>>
.apply(GroupByKey.create()) //PCollection<KV<String, Iterable<KV<Long, GenericRecord>>>>>
now I want to write all the Records in the Iterable in the same parquet file(derive the file name by the key of KV).现在我想将 Iterable 中的所有记录写入同一个 parquet 文件中(通过 KV 的键导出文件名)。
I found out the solution to the problem.我找到了解决问题的方法。 at the step -
在步骤 -
apply(GroupByKey.create()) //PCollection<KV<String, Iterable<KV<Long, GenericRecord>>>>>
I will apply another transform that will return only the Iterable as the output pCollection.我将应用另一个转换,它只返回 Iterable 作为输出 pCollection。 `.apply(ParDo.of(new GetIterable())) //PCollection>> where key is the name of the file I have to write to.
`.apply(ParDo.of(new GetIterable())) //PCollection>> 其中 key 是我必须写入的文件的名称。 then remaining snippet is
然后剩下的片段是
.apply(Flatten.iterables())
.apply(
FileIO.<String, KV<String, GenericRecord>>writeDynamic()
.by((SerializableFunction<KV<String, GenericRecord>, String>) KV::getKey)
.via(
Contextful.fn(
(SerializableFunction<KV<String, GenericRecord>, GenericRecord>) KV::getValue
),
ParquetIO.sink(schema)
.withCompressionCodec(CompressionCodecName.SNAPPY)
)
.withTempDirectory("/tmp/temp-beam")
.to(options.getGCSBucketUrl())
.withNumShards(1)
.withDestinationCoder(StringUtf8Coder.of())
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.