![](/img/trans.png)
[英]Json object to Parquet format using Java without converting to AVRO(Without using Spark, Hive, Pig,Impala)
[英]Q: Converting Avro to Parquet in Memory
我正在從 Kafka 接收 Avro 記錄。 我想將這些記錄轉換為 Parquet 文件。 我正在關注這篇博文: http : //blog.cloudera.com/blog/2014/05/how-to-convert-existing-data-into-parquet/
到目前為止的代碼大致如下:
final String fileName
SinkRecord record,
final AvroData avroData
final Schema avroSchema = avroData.fromConnectSchema(record.valueSchema());
CompressionCodecName compressionCodecName = CompressionCodecName.SNAPPY;
int blockSize = 256 * 1024 * 1024;
int pageSize = 64 * 1024;
Path path = new Path(fileName);
writer = new AvroParquetWriter<>(path, avroSchema, compressionCodecName, blockSize, pageSize);
現在,這將執行 Avro 到 Parquet 的轉換,但它會將 Parquet 文件寫入磁盤。 我想知道是否有更簡單的方法將文件保存在內存中,這樣我就不必管理磁盤上的臨時文件。 謝謝
"but it will write the Parquet file to the disk"
"if there was an easier way to just keep the file in memory"
從您的查詢中我了解到您不想將部分文件寫入鑲木地板。 如果您希望將完整文件以 parquet 格式寫入磁盤並將臨時文件寫入內存中,您可以結合使用內存映射文件和 parquet 格式。
將您的數據寫入內存映射文件,完成寫入后將字節轉換為鑲木地板格式並存儲到磁盤。
看看MappedByteBuffer 。
請查看我的博客, https: //yanbin.blog/convert-apache-avro-to-parquet-format-in-java/ 如有需要請翻譯成英文
package yanbin.blog;
import org.apache.parquet.io.DelegatingPositionOutputStream;
import org.apache.parquet.io.OutputFile;
import org.apache.parquet.io.PositionOutputStream;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.OutputStream;
public class InMemoryOutputFile implements OutputFile {
private final ByteArrayOutputStream baos = new ByteArrayOutputStream();
@Override
public PositionOutputStream create(long blockSizeHint) throws IOException { // Mode.CREATE calls this method
return new InMemoryPositionOutputStream(baos);
}
@Override
public PositionOutputStream createOrOverwrite(long blockSizeHint) throws IOException {
return null;
}
@Override
public boolean supportsBlockSize() {
return false;
}
@Override
public long defaultBlockSize() {
return 0;
}
public byte[] toArray() {
return baos.toByteArray();
}
private static class InMemoryPositionOutputStream extends DelegatingPositionOutputStream {
public InMemoryPositionOutputStream(OutputStream outputStream) {
super(outputStream);
}
@Override
public long getPos() throws IOException {
return ((ByteArrayOutputStream) this.getStream()).size();
}
}
}
public static <T extends SpecificRecordBase> void writeToParquet(List<T> avroObjects) throws IOException {
Schema avroSchema = avroObjects.get(0).getSchema();
GenericData genericData = GenericData.get();
genericData.addLogicalTypeConversion(new TimeConversions.DateConversion());
InMemoryOutputFile outputFile = new InMemoryOutputFile();
try (ParquetWriter<Object> writer = AvroParquetWriter.builder(outputFile)
.withDataModel(genericData)
.withSchema(avroSchema)
.withCompressionCodec(CompressionCodecName.SNAPPY)
.withWriteMode(ParquetFileWriter.Mode.CREATE)
.build()) {
avroObjects.forEach(r -> {
try {
writer.write(r);
} catch (IOException ex) {
throw new UncheckedIOException(ex);
}
});
} catch (IOException e) {
e.printStackTrace();
}
// dump memory data to file for testing
Files.write(Paths.get("./users-memory.parquet"), outputFile.toArray());
}
來自內存的測試數據
$ parquet-tools cat --json users-memory.parquet
$ parquet-tools schema users-memory.parquet
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.