简体   繁体   中英

How to convert Parquet file to Protobuf and save it HDFS/AWS S3

I have a file which is in Parquet format. I want to read it and save it in HDFS or AWS S3 in Protobuf format using spark with Scala. I am not sure of any way. Searched many blogs but could not understand anything, can anyone help?

You can use ProtoParquetReader, which is ParquetReader with ProtoReadSupport.

Something like:

       try (ParquetReader reader = ProtoParquetReader.builder(path).build()
        ) {
            while ((model = reader.read()) != null){
                System.out.println("check model " + "-- " + model);
...
            }
        } catch (IOException e) {
            e.printStackTrace();
        }

In order to read from parquet you need to use the following code :

public List<Record> read(Path path) {
     List<Record> records = new ArrayList<>();
     ParquetReader<Record> reader = AvroParquetReader<Record>builder(path).withConf(new Configuration()).build();
            for (Record value = reader.read(); value != null; value = reader.read()) {
                records.add(value);
            }
            return records;
}

Writing to a file from parquet would be something like this. Although this is not the protobuf file this might help you get started. Have in mind that you will have issues if you end up using spark-stream with protobuf v2.6 and greater

public void write(List<Record> records, String location) throws IOException {
        Path filePath = new Path(location);

        try (ParquetWriter<Record> writer = AvroParquetWriter.<GenericData.Record>builder(filePath)
            .withSchema(getSchema()) //
            .withConf(getConf()) //
            .withCompressionCodec(CompressionCodecName.SNAPPY) //
            .withWriteMode(Mode.CREATE) //
            .build()) {
            for (Record record : records) {
                writer.write(record);
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM