简体   繁体   English

如何使用 Java 在 Apache Beam 中将 JSON 转换为 Parquet

[英]How to convert JSON to Parquet in Apache Beam using Java

I am trying to convert Json Data我正在尝试转换 Json 数据

{"col1":"sample-val-1", "col2":1.0}
{"col1":"sample-val-2", "col2":2.0}
{"col1":"sample-val-3", "col2":3.0}
{"col1":"sample-val-4", "col2":4.0}
{"col1":"sample-val-5", "col2":5.0}

and I need this to be converted in Parquet我需要将其转换为 Parquet

then I wrote some code in Apache Beam然后我在 Apache Beam 中写了一些代码

package org.apache.beam.examples;

import com.google.gson.Gson;
import com.google.gson.GsonBuilder;
import com.google.gson.JsonObject;
import org.apache.avro.Schema;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.values.PCollection;
import org.kitesdk.data.spi.JsonUtil;
import tech.allegro.schema.json2avro.converter.JsonAvroConverter;

import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.nio.file.Files;

public class Main {

    public static void main(String[] args) throws IOException {

        Pipeline pipeLine = Pipeline.create();
        PCollection<String> lines = pipeLine.apply("ReadMyFile", TextIO.read().from("path-to-file"));

        File initialFile = new File("path-to-file");
        InputStream targetStream = Files.newInputStream(initialFile.toPath());
        Schema jsonSchema = JsonUtil.inferSchema(targetStream, "RecordName", 20);
        System.out.println(jsonSchema.getDoc());
        PCollection<String> words = lines.apply(ParDo.of(new DoFn<String, String>() {
            @ProcessElement
            public void processElement(ProcessContext c) {
                Gson gson = new GsonBuilder().create();
                JsonObject parsedMap = gson.fromJson(c.element(), JsonObject.class);
//                out.output(parsedMap);
//                System.out.println(Arrays.toString(parsedMap.toString().getBytes(StandardCharsets.UTF_8)));
                JsonAvroConverter avroConverter = new JsonAvroConverter();
//                GenericRecord record =  avroConverter.convertToGenericDataRecord(parsedMap.toString().getBytes(), jsonSchema);

//                context.output(record);
            }
        }));
        pipeLine.run();
        //
//        pgr.apply(FileIO.<GenericRecord>write().via(ParquetIO.sink(schema)).to("path/to/save"));
        
    }
}

I am able to get line by line json but unable to convert it to Parquet the above code throws error if you try to convert the Json to Parquet using我能够逐行获取 json 但无法将其转换为 Parquet 如果您尝试将 Json 转换为 Parquet

GenericRecord record =  avroConverter.convertToGenericDataRecord(parsedMap.toString().getBytes(), jsonSchema);

error due to this line由于此行而导致的错误

Caused by: java.io.NotSerializableException: org.apache.avro.Schema$RecordSchema
    at java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1185)
    at java.base/java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1553)
    at java.base/java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1510)
    at java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1433)
    at java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179)
    at java.base/java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1553)
    at java.base/java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1510)
    at java.base/java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1433)
    at java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1179)
    at java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349)
    at org.apache.beam.sdk.util.SerializableUtils.serializeToByteArray(SerializableUtils.java:55)
    ... 25 more

I created a new class and added as parameter construct and created it in @Setup.我创建了一个新的 class 并添加为参数构造并在@Setup 中创建它。

Schema jsonSchema = new Schema.Parser().parse(schemaString);
pipeLine.apply("ReadMyFile", TextIO.read().from(options.getInput()))
                .apply("Convert Json To General Record", ParDo.of(new JsonToGeneralRecord(jsonSchema)))
                .setCoder(AvroCoder.of(GenericRecord.class, jsonSchema))

    private static final Logger logger = LogManager.getLogger(JsonToGeneralRecord.class);

    private final String schemaString;
    private Schema jsonSchema;


    // constructor
    JsonToGeneralRecord(Schema schema) {
        schemaString = schema.toString();
    }

    @Setup
    public void setup() {
        jsonSchema = new Schema.Parser().parse(schemaString);
    }

    @ProcessElement
    public void processElement(ProcessContext c) throws Exception {

        Gson gson = new GsonBuilder().create();
        JsonObject parsedMap = gson.fromJson(c.element(), JsonObject.class);
        logger.info("successful: " + parsedMap.toString());

        JsonAvroConverter avroConverter = new JsonAvroConverter();
        try {
            GenericRecord record = avroConverter.convertToGenericDataRecord(parsedMap.toString().getBytes(), jsonSchema);
            c.output(record);
        } catch (Exception e) {
            logger.error("error:  " + e.getMessage() + parsedMap);
            e.printStackTrace();
        }

    }
}```

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何转换 PCollection<row> 在数据流 Apache 中使用 Java 束</row> - How to convert PCollection<Row> to Long in Dataflow Apache beam using Java 如何转换 PCollection<row> 使用 Java 到数据流 Apache 中的 Integer</row> - How to convert PCollection<Row> to Integer in Dataflow Apache beam using Java 如何在Java中使用Apache Beam ParDo函数读取JSON文件 - How to read a JSON file using Apache beam parDo function in Java 在Java中将JSON转换为镶木地板 - Convert JSON to parquet in Java 使用 Apache Beam Java SDK 读取 Parquet 文件而不提供架构 - Read Parquet file using Apache Beam Java SDK without providing schema 如何使用java中的Apache Beam直达写入BigTable? - How to write to BigTable using Apache Beam direct-runner in java? Kafka Avro 到 BigQuery 使用 Java 中的 Apache Beam - Kafka Avro To BigQuery using Apache Beam in Java 如何使用 KafkaIO 和 Apache 使用 Java 设置 AvroCoder - How to set AvroCoder with KafkaIO and Apache Beam with Java 使用 apache 束谷歌数据流和 Z93F725A47423D21C83863 将具有未知 json 属性的大型 jsonl 文件转换为 csv - Transform a large jsonl file with unknown json properties into csv using apache beam google dataflow and java 如何使用 apache 箭头在 java 中编写镶木地板文件 - how to write parquet files in java with apache arrow
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM