如何在 Apache Beam 中写入多个文件？

Question

Let me simplify my case.让我简化一下我的情况。 I'm using Apache Beam 0.6.0.我正在使用 Apache Beam 0.6.0。 My final processed result is PCollection<KV<String, String>> .我的最终处理结果是PCollection<KV<String, String>> 。 And I want to write values to different files corresponding to their keys.我想将值写入与其键对应的不同文件。

For example, let's say the result consists of例如，假设结果包括

(key1, value1)
(key2, value2)
(key1, value3)
(key1, value4)

Then I want to write value1 , value3 and value4 to key1.txt , and write value4 to key2.txt .然后，我想写value1 ， value3和value4至key1.txt ，并写入value4至key2.txt 。

And in my case:就我而言：

Key set is determined when the pipeline is running, not when constructing the pipeline.密钥集是在管道运行时确定的，而不是在构建管道时确定。
Key set may be quite small, but the number of values corresponding to each key may be very very large.键集可能很小，但是每个键对应的值的数量可能非常非常大。

Any ideas?有任何想法吗？

Answer 1

Handily, I wrote a sample of this case just the other day.前几天，我很方便地写了这个案例的样本。

This example is dataflow 1.x style此示例是数据流 1.x 样式

Basically you group by each key, and then you can do this with a custom transform that connects to cloud storage.基本上，您按每个键分组，然后您可以使用连接到云存储的自定义转换来完成此操作。 Caveat being that your list of lines per-file shouldn't be massive (it's got to fit into memory on a single instance, but considering you can run high-mem instances, that limit is pretty high).需要注意的是，每个文件的行列表不应很大（它必须适合单个实例的内存，但考虑到您可以运行高内存实例，该限制非常高）。

    ...
    PCollection<KV<String, List<String>>> readyToWrite = groupedByFirstLetter
                .apply(Combine.perKey(AccumulatorOfWords.getCombineFn()));
        readyToWrite.apply(
                new PTransformWriteToGCS("dataflow-experiment", TonyWordGrouper::derivePath));
    ...

And then the transform doing most of the work is:然后完成大部分工作的转换是：

public class PTransformWriteToGCS
    extends PTransform<PCollection<KV<String, List<String>>>, PCollection<Void>> {

    private static final Logger LOG = Logging.getLogger(PTransformWriteToGCS.class);

    private static final Storage STORAGE = StorageOptions.getDefaultInstance().getService();

    private final String bucketName;

    private final SerializableFunction<String, String> pathCreator;

    public PTransformWriteToGCS(final String bucketName,
        final SerializableFunction<String, String> pathCreator) {
        this.bucketName = bucketName;
        this.pathCreator = pathCreator;
    }

    @Override
    public PCollection<Void> apply(final PCollection<KV<String, List<String>>> input) {

        return input
            .apply(ParDo.of(new DoFn<KV<String, List<String>>, Void>() {

                @Override
                public void processElement(
                    final DoFn<KV<String, List<String>>, Void>.ProcessContext arg0)
                    throws Exception {
                    final String key = arg0.element().getKey();
                    final List<String> values = arg0.element().getValue();
                    final String toWrite = values.stream().collect(Collectors.joining("\n"));
                    final String path = pathCreator.apply(key);
                    BlobInfo blobInfo = BlobInfo.newBuilder(bucketName, path)
                        .setContentType(MimeTypes.TEXT)
                        .build();
                    LOG.info("blob writing to: {}", blobInfo);
                    Blob result = STORAGE.create(blobInfo,
                        toWrite.getBytes(StandardCharsets.UTF_8));
                }
            }));
    }
}

Answer 2

Just write a loop in a ParDo function!只需在 ParDo 函数中编写一个循环即可！ More details - I had the same scenario today, the only thing is in my case key=image_label and value=image_tf_record.更多细节 - 我今天遇到了同样的情况，唯一的问题是在我的情况下 key=image_label 和 value=image_tf_record。 So like what you have asked, I am trying to create separate TFRecord files, one per class, each record file containing a number of images.所以就像你问的那样，我正在尝试创建单独的 TFRecord 文件，每个类一个，每个记录文件包含许多图像。 HOWEVER not sure if there might be memory issues when a number of values per key are very high like your scenario: (Also my code is in Python)但是不确定当每个键的值数量非常高时是否可能存在内存问题，就像您的场景一样：（我的代码也是用 Python 编写的）

class WriteToSeparateTFRecordFiles(beam.DoFn):

def __init__(self, outdir):
    self.outdir = outdir

def process(self, element):
    l, image_list = element
    writer = tf.python_io.TFRecordWriter(self.outdir + "/tfr" + str(l) + '.tfrecord')
    for example in image_list:
        writer.write(example.SerializeToString())
    writer.close()

And then in your pipeline just after the stage where you get key-value pairs to add these two lines:然后在您的管道中，在您获得键值对的阶段之后添加这两行：

   (p
    | 'GroupByLabelId' >> beam.GroupByKey()
    | 'SaveToMultipleFiles' >> beam.ParDo(WriteToSeparateTFRecordFiles(opt, p))
    )

Answer 3

you can use FileIO.writeDinamic() for that你可以使用 FileIO.writeDinamic()

PCollection<KV<String,String>> readfile= (something you read..);

readfile.apply(FileIO. <String,KV<String,String >> writeDynamic()
    .by(KV::getKey)
    .withDestinationCoder(StringUtf8Coder.of())
    .via(Contextful.fn(KV::getValue), TextIO.sink())
    .to("somefolder")
    .withNaming(key -> FileIO.Write.defaultNaming(key, ".txt")));

p.run();

Answer 4

In Apache Beam 2.2 Java SDK, this is natively supported in TextIO and AvroIO using respectively TextIO and AvroIO.write().to(DynamicDestinations) .在 Apache Beam 2.2 Java SDK 中，这在TextIO和AvroIO中分别使用TextIO和AvroIO.write().to(DynamicDestinations) 。 See eg this method .参见例如这个方法。

Update (2018): Prefer to use FileIO.writeDynamic() together with TextIO.sink() and AvroIO.sink() instead.更新（2018 年）：更喜欢将FileIO.writeDynamic()与TextIO.sink()和AvroIO.sink()一起使用。

Answer 5

Just write below lines in your ParDo class :只需在 ParDo 类中写下以下几行：

 from apache_beam.io import filesystems eventCSVFileWriter = filesystems.FileSystems.create(gcsFileName) for record in list(Records): eventCSVFileWriter.write(record)

If you want the full code I can help you with that too.如果你想要完整的代码，我也可以帮你。

如何在 Apache Beam 中写入多个文件？

问题描述

5 个解决方案

解决方案1
5 2017-04-11 21:15:18

解决方案2
4 2017-10-20 07:27:58

解决方案3
3 2019-07-11 08:23:58

解决方案4
2 2017-12-08 02:07:41

解决方案5
-1 2018-04-25 18:42:36

如何在 Apache Beam 中写入多个文件？

问题描述

5 个解决方案

解决方案1 5 2017-04-11 21:15:18

解决方案2 4 2017-10-20 07:27:58

解决方案3 3 2019-07-11 08:23:58

解决方案4 2 2017-12-08 02:07:41

解决方案5 -1 2018-04-25 18:42:36

解决方案1
5 2017-04-11 21:15:18

解决方案2
4 2017-10-20 07:27:58

解决方案3
3 2019-07-11 08:23:58

解决方案4
2 2017-12-08 02:07:41

解决方案5
-1 2018-04-25 18:42:36