简体   繁体   English

如何在 Apache Beam 中写入多个文件?

[英]How do I write to multiple files in Apache Beam?

Let me simplify my case.让我简化一下我的情况。 I'm using Apache Beam 0.6.0.我正在使用 Apache Beam 0.6.0。 My final processed result is PCollection<KV<String, String>> .我的最终处理结果是PCollection<KV<String, String>> And I want to write values to different files corresponding to their keys.我想将值写入与其键对应的不同文件。

For example, let's say the result consists of例如,假设结果包括

(key1, value1)
(key2, value2)
(key1, value3)
(key1, value4)

Then I want to write value1 , value3 and value4 to key1.txt , and write value4 to key2.txt .然后,我想写value1value3value4key1.txt ,并写入value4key2.txt

And in my case:就我而言:

  • Key set is determined when the pipeline is running, not when constructing the pipeline.密钥集是在管道运行时确定的,而不是在构建管道时确定。
  • Key set may be quite small, but the number of values corresponding to each key may be very very large.键集可能很小,但是每个键对应的值的数量可能非常非常大。

Any ideas?有任何想法吗?

Handily, I wrote a sample of this case just the other day.前几天,我很方便地写了这个案例的样本。

This example is dataflow 1.x style此示例是数据流 1.x 样式

Basically you group by each key, and then you can do this with a custom transform that connects to cloud storage.基本上,您按每个键分组,然后您可以使用连接到云存储的自定义转换来完成此操作。 Caveat being that your list of lines per-file shouldn't be massive (it's got to fit into memory on a single instance, but considering you can run high-mem instances, that limit is pretty high).需要注意的是,每个文件的行列表不应很大(它必须适合单个实例的内存,但考虑到您可以运行高内存实例,该限制非常高)。

    ...
    PCollection<KV<String, List<String>>> readyToWrite = groupedByFirstLetter
                .apply(Combine.perKey(AccumulatorOfWords.getCombineFn()));
        readyToWrite.apply(
                new PTransformWriteToGCS("dataflow-experiment", TonyWordGrouper::derivePath));
    ...

And then the transform doing most of the work is:然后完成大部分工作的转换是:

public class PTransformWriteToGCS
    extends PTransform<PCollection<KV<String, List<String>>>, PCollection<Void>> {

    private static final Logger LOG = Logging.getLogger(PTransformWriteToGCS.class);

    private static final Storage STORAGE = StorageOptions.getDefaultInstance().getService();

    private final String bucketName;

    private final SerializableFunction<String, String> pathCreator;

    public PTransformWriteToGCS(final String bucketName,
        final SerializableFunction<String, String> pathCreator) {
        this.bucketName = bucketName;
        this.pathCreator = pathCreator;
    }

    @Override
    public PCollection<Void> apply(final PCollection<KV<String, List<String>>> input) {

        return input
            .apply(ParDo.of(new DoFn<KV<String, List<String>>, Void>() {

                @Override
                public void processElement(
                    final DoFn<KV<String, List<String>>, Void>.ProcessContext arg0)
                    throws Exception {
                    final String key = arg0.element().getKey();
                    final List<String> values = arg0.element().getValue();
                    final String toWrite = values.stream().collect(Collectors.joining("\n"));
                    final String path = pathCreator.apply(key);
                    BlobInfo blobInfo = BlobInfo.newBuilder(bucketName, path)
                        .setContentType(MimeTypes.TEXT)
                        .build();
                    LOG.info("blob writing to: {}", blobInfo);
                    Blob result = STORAGE.create(blobInfo,
                        toWrite.getBytes(StandardCharsets.UTF_8));
                }
            }));
    }
}

Just write a loop in a ParDo function!只需在 ParDo 函数中编写一个循环即可! More details - I had the same scenario today, the only thing is in my case key=image_label and value=image_tf_record.更多细节 - 我今天遇到了同样的情况,唯一的问题是在我的情况下 key=image_label 和 value=image_tf_record。 So like what you have asked, I am trying to create separate TFRecord files, one per class, each record file containing a number of images.所以就像你问的那样,我正在尝试创建单独的 TFRecord 文件,每个类一个,每个记录文件包含许多图像。 HOWEVER not sure if there might be memory issues when a number of values per key are very high like your scenario: (Also my code is in Python)但是不确定当每个键的值数量非常高时是否可能存在内存问题,就像您的场景一样:(我的代码也是用 Python 编写的)

class WriteToSeparateTFRecordFiles(beam.DoFn):

def __init__(self, outdir):
    self.outdir = outdir

def process(self, element):
    l, image_list = element
    writer = tf.python_io.TFRecordWriter(self.outdir + "/tfr" + str(l) + '.tfrecord')
    for example in image_list:
        writer.write(example.SerializeToString())
    writer.close()

And then in your pipeline just after the stage where you get key-value pairs to add these two lines:然后在您的管道中,在您获得键值对的阶段之后添加这两行:

   (p
    | 'GroupByLabelId' >> beam.GroupByKey()
    | 'SaveToMultipleFiles' >> beam.ParDo(WriteToSeparateTFRecordFiles(opt, p))
    )

you can use FileIO.writeDinamic() for that你可以使用 FileIO.writeDinamic()

PCollection<KV<String,String>> readfile= (something you read..);

readfile.apply(FileIO. <String,KV<String,String >> writeDynamic()
    .by(KV::getKey)
    .withDestinationCoder(StringUtf8Coder.of())
    .via(Contextful.fn(KV::getValue), TextIO.sink())
    .to("somefolder")
    .withNaming(key -> FileIO.Write.defaultNaming(key, ".txt")));

p.run();

In Apache Beam 2.2 Java SDK, this is natively supported in TextIO and AvroIO using respectively TextIO and AvroIO.write().to(DynamicDestinations) .在 Apache Beam 2.2 Java SDK 中,这在TextIOAvroIO中分别使用TextIOAvroIO.write().to(DynamicDestinations) See eg this method .参见例如这个方法

Update (2018): Prefer to use FileIO.writeDynamic() together with TextIO.sink() and AvroIO.sink() instead.更新(2018 年):更喜欢将FileIO.writeDynamic()TextIO.sink()AvroIO.sink()一起使用。

Just write below lines in your ParDo class :只需在 ParDo 类中写下以下几行:

 from apache_beam.io import filesystems eventCSVFileWriter = filesystems.FileSystems.create(gcsFileName) for record in list(Records): eventCSVFileWriter.write(record)

If you want the full code I can help you with that too.如果你想要完整的代码,我也可以帮你。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM