简体   繁体   English

从光束管道写入 tfrecords?

[英]Write tfrecords from beam pipeline?

I have some data in Map format and I want to convert them to tfrecords, using the beam pipeline.我有一些 Map 格式的数据,我想使用光束管道将它们转换为 tfrecords。 Here is my attempt to write the code.这是我编写代码的尝试。 I have attempted this in python which works but I need to implement this in java as some business logic is there which I can't port to python.我已经在 python 中尝试了这个,它可以工作,但我需要在 java 中实现这个,因为那里有一些我无法移植到 python 的业务逻辑。 The corresponding working python implementation can be found here in this question .可以在此问题中找到相应的工作 python 实现。

import com.google.protobuf.ByteString;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.extensions.protobuf.ProtoCoder;
import org.apache.beam.sdk.io.TFRecordIO;
import org.apache.beam.sdk.transforms.Create;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.commons.lang3.RandomStringUtils;
import org.tensorflow.example.BytesList;
import org.tensorflow.example.Example;
import org.tensorflow.example.Feature;
import org.tensorflow.example.Features;

import java.nio.charset.StandardCharsets;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;
import java.util.stream.IntStream;

public class Sample {

    static class Foo extends DoFn<Map<String, String>, Example> {

        public static Feature stringToFeature(String value) {
            ByteString byteString = ByteString.copyFrom(value.getBytes(StandardCharsets.UTF_8));
            BytesList bytesList = BytesList.newBuilder().addValue(byteString).build();
            return Feature.newBuilder().setBytesList(bytesList).build();
        }

        public void processElement(@Element Map<String, String> element, OutputReceiver<Example> receiver) {

            Features features = Features.newBuilder()
                    .putFeature("foo", stringToFeature(element.get("foo")))
                    .putFeature("bar", stringToFeature(element.get("bar")))
                    .build();

            Example example = Example
                    .newBuilder()
                    .setFeatures(features)
                    .build();

            receiver.output(example);
        }

    }

    private static Map<String, String> generateRecord() {
        String[] keys = {"foo", "bar"};
        return IntStream.range(0,keys.length)
                .boxed()
                .collect(Collectors
                        .toMap(i -> keys[i],
                                i -> RandomStringUtils.randomAlphabetic(8)));
    }

    public static void main(String[] args) {

        List<Map<String, String>> records = new ArrayList<>();
        for (int i=0; i<10; i++) {
            records.add(generateRecord());
        }

        System.out.println(records);
        Pipeline p = Pipeline.create();

        p.apply("Input creation", Create.of(records))
                .apply("Encode to Exampple", ParDo.of(new Foo())).setCoder(ProtoCoder.of(Example.class))
                .apply("Write to disk",
                        TFRecordIO.write()
                                .to("output")
                                .withNumShards(2)
                                .withSuffix(".tfrecord"));

        p.run();


    }
}

For the above code I am getting the following error at compile time对于上面的代码,我在编译时收到以下错误

Error:(70, 17) java: no suitable method found for apply(java.lang.String,org.apache.beam.sdk.io.TFRecordIO.Write)
    method org.apache.beam.sdk.values.PCollection.<OutputT>apply(org.apache.beam.sdk.transforms.PTransform<? super org.apache.beam.sdk.values.PCollection<org.tensorflow.example.Example>,OutputT>) is not applicable
      (cannot infer type-variable(s) OutputT
        (actual and formal argument lists differ in length))
    method org.apache.beam.sdk.values.PCollection.<OutputT>apply(java.lang.String,org.apache.beam.sdk.transforms.PTransform<? super org.apache.beam.sdk.values.PCollection<org.tensorflow.example.Example>,OutputT>) is not applicable
      (cannot infer type-variable(s) OutputT
        (argument mismatch; org.apache.beam.sdk.io.TFRecordIO.Write cannot be converted to org.apache.beam.sdk.transforms.PTransform<? super org.apache.beam.sdk.values.PCollection<org.tensorflow.example.Example>,OutputT>))

input to TFRecordIO.write() should be byte[] so making following changes worked for me. TFRecordIO.write()的输入应该是byte[]因此进行以下更改对我有用。

static class Foo extends DoFn<Map<String, String>, byte[]> {

    public static Feature stringToFeature(String value) {
        ByteString byteString = ByteString.copyFrom(value.getBytes(StandardCharsets.UTF_8));
        BytesList bytesList = BytesList.newBuilder().addValue(byteString).build();
        return Feature.newBuilder().setBytesList(bytesList).build();
    }

    public void processElement(@Element Map<String, String> element, OutputReceiver<byte[]> receiver) {

        Features features = Features.newBuilder()
                .putFeature("foo", stringToFeature(element.get("foo")))
                .putFeature("bar", stringToFeature(element.get("bar")))
                .build();

        Example example = Example
                .newBuilder()
                .setFeatures(features)
                .build();

        receiver.output(example.toByteArray());
    }

}

You need to convert the input to TFRecordIO to be byte[]您需要将输入转换为 TFRecordIO 为 byte[]

You can do it by using a transform like您可以通过使用类似的转换来做到这一点

static class StringToByteArray extends DoFn<String, byte[]> {
 @ProcessElement
 public void processElement(ProcessContext c) {
  c.output(c.element().getBytes(Charsets.UTF_8));
 }
} 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Apache Beam管道从csv文件读取,拆分,groupbyKey并写入文本文件时出现“ IllegalStateException”错误。 为什么? - “IllegalStateException” error for Apache Beam pipeline to read from csv file, split, groupbyKey and write to text file. Why? 使用 java 在 apche_beam 中写入 tfrecord - Writing tfrecords in apche_beam with java 从 Apache Beam (GCP Dataflow) 写入 ConfluentCloud - Write to ConfluentCloud from Apache Beam (GCP Dataflow) 我们可以在具有Apache Beam管道的单个C​​loudSQL连接中的JDBCIO.write函数中执行多个插入查询吗? - Can we execute multiple insert queries in JDBCIO.write function in single CloudSQL connection with an apache beam pipeline? 在 Java Beam 管道中的日期/时间戳上使用 LogicalType 'timestamp-millis' 编写 avro 文件 - Write avro files with LogicalType 'timestamp-millis' on date/timestamps in Java Beam pipeline 如何从Apache Beam作业正确写入InfluxDB? - How to write properly from an Apache Beam job to InfluxDB? 如何解析光束管道中的json? - How to parse json in a beam pipeline? Apache Beam-跳过管道步骤 - Apache Beam - skip pipeline step Apache Beam - 在管道中添加延迟 - Apache Beam - adding a delay into a pipeline Apache Beam Pipeline - 序列化问题 - Apache Beam Pipeline - Serialization problem
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM