从光束管道写入 tfrecords？

Question

I have some data in Map format and I want to convert them to tfrecords, using the beam pipeline.我有一些 Map 格式的数据，我想使用光束管道将它们转换为 tfrecords。 Here is my attempt to write the code.这是我编写代码的尝试。 I have attempted this in python which works but I need to implement this in java as some business logic is there which I can't port to python.我已经在 python 中尝试了这个，它可以工作，但我需要在 java 中实现这个，因为那里有一些我无法移植到 python 的业务逻辑。 The corresponding working python implementation can be found here in this question .可以在此问题中找到相应的工作 python 实现。

import com.google.protobuf.ByteString;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.extensions.protobuf.ProtoCoder;
import org.apache.beam.sdk.io.TFRecordIO;
import org.apache.beam.sdk.transforms.Create;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.commons.lang3.RandomStringUtils;
import org.tensorflow.example.BytesList;
import org.tensorflow.example.Example;
import org.tensorflow.example.Feature;
import org.tensorflow.example.Features;

import java.nio.charset.StandardCharsets;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;
import java.util.stream.IntStream;

public class Sample {

    static class Foo extends DoFn<Map<String, String>, Example> {

        public static Feature stringToFeature(String value) {
            ByteString byteString = ByteString.copyFrom(value.getBytes(StandardCharsets.UTF_8));
            BytesList bytesList = BytesList.newBuilder().addValue(byteString).build();
            return Feature.newBuilder().setBytesList(bytesList).build();
        }

        public void processElement(@Element Map<String, String> element, OutputReceiver<Example> receiver) {

            Features features = Features.newBuilder()
                    .putFeature("foo", stringToFeature(element.get("foo")))
                    .putFeature("bar", stringToFeature(element.get("bar")))
                    .build();

            Example example = Example
                    .newBuilder()
                    .setFeatures(features)
                    .build();

            receiver.output(example);
        }

    }

    private static Map<String, String> generateRecord() {
        String[] keys = {"foo", "bar"};
        return IntStream.range(0,keys.length)
                .boxed()
                .collect(Collectors
                        .toMap(i -> keys[i],
                                i -> RandomStringUtils.randomAlphabetic(8)));
    }

    public static void main(String[] args) {

        List<Map<String, String>> records = new ArrayList<>();
        for (int i=0; i<10; i++) {
            records.add(generateRecord());
        }

        System.out.println(records);
        Pipeline p = Pipeline.create();

        p.apply("Input creation", Create.of(records))
                .apply("Encode to Exampple", ParDo.of(new Foo())).setCoder(ProtoCoder.of(Example.class))
                .apply("Write to disk",
                        TFRecordIO.write()
                                .to("output")
                                .withNumShards(2)
                                .withSuffix(".tfrecord"));

        p.run();


    }
}

For the above code I am getting the following error at compile time对于上面的代码，我在编译时收到以下错误

Error:(70, 17) java: no suitable method found for apply(java.lang.String,org.apache.beam.sdk.io.TFRecordIO.Write)
    method org.apache.beam.sdk.values.PCollection.<OutputT>apply(org.apache.beam.sdk.transforms.PTransform<? super org.apache.beam.sdk.values.PCollection<org.tensorflow.example.Example>,OutputT>) is not applicable
      (cannot infer type-variable(s) OutputT
        (actual and formal argument lists differ in length))
    method org.apache.beam.sdk.values.PCollection.<OutputT>apply(java.lang.String,org.apache.beam.sdk.transforms.PTransform<? super org.apache.beam.sdk.values.PCollection<org.tensorflow.example.Example>,OutputT>) is not applicable
      (cannot infer type-variable(s) OutputT
        (argument mismatch; org.apache.beam.sdk.io.TFRecordIO.Write cannot be converted to org.apache.beam.sdk.transforms.PTransform<? super org.apache.beam.sdk.values.PCollection<org.tensorflow.example.Example>,OutputT>))

Answer 1

input to TFRecordIO.write() should be byte[] so making following changes worked for me. TFRecordIO.write()的输入应该是byte[]因此进行以下更改对我有用。

static class Foo extends DoFn<Map<String, String>, byte[]> {

    public static Feature stringToFeature(String value) {
        ByteString byteString = ByteString.copyFrom(value.getBytes(StandardCharsets.UTF_8));
        BytesList bytesList = BytesList.newBuilder().addValue(byteString).build();
        return Feature.newBuilder().setBytesList(bytesList).build();
    }

    public void processElement(@Element Map<String, String> element, OutputReceiver<byte[]> receiver) {

        Features features = Features.newBuilder()
                .putFeature("foo", stringToFeature(element.get("foo")))
                .putFeature("bar", stringToFeature(element.get("bar")))
                .build();

        Example example = Example
                .newBuilder()
                .setFeatures(features)
                .build();

        receiver.output(example.toByteArray());
    }

}

Answer 2

You need to convert the input to TFRecordIO to be byte[]您需要将输入转换为 TFRecordIO 为 byte[]

You can do it by using a transform like您可以通过使用类似的转换来做到这一点

static class StringToByteArray extends DoFn<String, byte[]> {
 @ProcessElement
 public void processElement(ProcessContext c) {
  c.output(c.element().getBytes(Charsets.UTF_8));
 }
}

从光束管道写入 tfrecords？

问题描述

2 个解决方案

解决方案1
0 已采纳 2020-04-22 14:25:44

解决方案2
-1 2020-04-18 03:12:26

从光束管道写入 tfrecords？

问题描述

2 个解决方案

解决方案1 0 已采纳 2020-04-22 14:25:44

解决方案2 -1 2020-04-18 03:12:26

解决方案1
0 已采纳 2020-04-22 14:25:44

解决方案2
-1 2020-04-18 03:12:26