如何在 Apache Beam 中寫入多個文件？

Question

讓我簡化一下我的情況。 我正在使用 Apache Beam 0.6.0。 我的最終處理結果是PCollection<KV<String, String>> 。 我想將值寫入與其鍵對應的不同文件。

例如，假設結果包括

(key1, value1)
(key2, value2)
(key1, value3)
(key1, value4)

然后，我想寫value1 ， value3和value4至key1.txt ，並寫入value4至key2.txt 。

就我而言：

密鑰集是在管道運行時確定的，而不是在構建管道時確定。
鍵集可能很小，但是每個鍵對應的值的數量可能非常非常大。

有任何想法嗎？

Answer 1

前幾天，我很方便地寫了這個案例的樣本。

此示例是數據流 1.x 樣式

基本上，您按每個鍵分組，然后您可以使用連接到雲存儲的自定義轉換來完成此操作。 需要注意的是，每個文件的行列表不應很大（它必須適合單個實例的內存，但考慮到您可以運行高內存實例，該限制非常高）。

    ...
    PCollection<KV<String, List<String>>> readyToWrite = groupedByFirstLetter
                .apply(Combine.perKey(AccumulatorOfWords.getCombineFn()));
        readyToWrite.apply(
                new PTransformWriteToGCS("dataflow-experiment", TonyWordGrouper::derivePath));
    ...

然后完成大部分工作的轉換是：

public class PTransformWriteToGCS
    extends PTransform<PCollection<KV<String, List<String>>>, PCollection<Void>> {

    private static final Logger LOG = Logging.getLogger(PTransformWriteToGCS.class);

    private static final Storage STORAGE = StorageOptions.getDefaultInstance().getService();

    private final String bucketName;

    private final SerializableFunction<String, String> pathCreator;

    public PTransformWriteToGCS(final String bucketName,
        final SerializableFunction<String, String> pathCreator) {
        this.bucketName = bucketName;
        this.pathCreator = pathCreator;
    }

    @Override
    public PCollection<Void> apply(final PCollection<KV<String, List<String>>> input) {

        return input
            .apply(ParDo.of(new DoFn<KV<String, List<String>>, Void>() {

                @Override
                public void processElement(
                    final DoFn<KV<String, List<String>>, Void>.ProcessContext arg0)
                    throws Exception {
                    final String key = arg0.element().getKey();
                    final List<String> values = arg0.element().getValue();
                    final String toWrite = values.stream().collect(Collectors.joining("\n"));
                    final String path = pathCreator.apply(key);
                    BlobInfo blobInfo = BlobInfo.newBuilder(bucketName, path)
                        .setContentType(MimeTypes.TEXT)
                        .build();
                    LOG.info("blob writing to: {}", blobInfo);
                    Blob result = STORAGE.create(blobInfo,
                        toWrite.getBytes(StandardCharsets.UTF_8));
                }
            }));
    }
}

Answer 2

只需在 ParDo 函數中編寫一個循環即可！ 更多細節 - 我今天遇到了同樣的情況，唯一的問題是在我的情況下 key=image_label 和 value=image_tf_record。 所以就像你問的那樣，我正在嘗試創建單獨的 TFRecord 文件，每個類一個，每個記錄文件包含許多圖像。 但是不確定當每個鍵的值數量非常高時是否可能存在內存問題，就像您的場景一樣：（我的代碼也是用 Python 編寫的）

class WriteToSeparateTFRecordFiles(beam.DoFn):

def __init__(self, outdir):
    self.outdir = outdir

def process(self, element):
    l, image_list = element
    writer = tf.python_io.TFRecordWriter(self.outdir + "/tfr" + str(l) + '.tfrecord')
    for example in image_list:
        writer.write(example.SerializeToString())
    writer.close()

然后在您的管道中，在您獲得鍵值對的階段之后添加這兩行：

   (p
    | 'GroupByLabelId' >> beam.GroupByKey()
    | 'SaveToMultipleFiles' >> beam.ParDo(WriteToSeparateTFRecordFiles(opt, p))
    )

Answer 3

你可以使用 FileIO.writeDinamic()

PCollection<KV<String,String>> readfile= (something you read..);

readfile.apply(FileIO. <String,KV<String,String >> writeDynamic()
    .by(KV::getKey)
    .withDestinationCoder(StringUtf8Coder.of())
    .via(Contextful.fn(KV::getValue), TextIO.sink())
    .to("somefolder")
    .withNaming(key -> FileIO.Write.defaultNaming(key, ".txt")));

p.run();

Answer 4

在 Apache Beam 2.2 Java SDK 中，這在TextIO和AvroIO中分別使用TextIO和AvroIO.write().to(DynamicDestinations) 。 參見例如這個方法。

更新（2018 年）：更喜歡將FileIO.writeDynamic()與TextIO.sink()和AvroIO.sink()一起使用。

Answer 5

只需在 ParDo 類中寫下以下幾行：

 from apache_beam.io import filesystems eventCSVFileWriter = filesystems.FileSystems.create(gcsFileName) for record in list(Records): eventCSVFileWriter.write(record)

如果你想要完整的代碼，我也可以幫你。

如何在 Apache Beam 中寫入多個文件？

問題描述

5 個解決方案

解決方案1
5 2017-04-11 21:15:18

解決方案2
4 2017-10-20 07:27:58

解決方案3
3 2019-07-11 08:23:58

解決方案4
2 2017-12-08 02:07:41

解決方案5
-1 2018-04-25 18:42:36

如何在 Apache Beam 中寫入多個文件？

問題描述

5 個解決方案

解決方案1 5 2017-04-11 21:15:18

解決方案2 4 2017-10-20 07:27:58

解決方案3 3 2019-07-11 08:23:58

解決方案4 2 2017-12-08 02:07:41

解決方案5 -1 2018-04-25 18:42:36

解決方案1
5 2017-04-11 21:15:18

解決方案2
4 2017-10-20 07:27:58

解決方案3
3 2019-07-11 08:23:58

解決方案4
2 2017-12-08 02:07:41

解決方案5
-1 2018-04-25 18:42:36