[英]how to created REPEATED type in parquet file schema with avro?
We are creating a dataflow pipeline, we will read the data from postgres and write it to a parquet file. 我们正在创建一个数据流管道,我们将从postgres中读取数据并将其写入一个Parquet文件中。 ParquetIO.Sink allows you to write a PCollection of GenericRecord into a Parquet file (from here https://beam.apache.org/releases/javadoc/2.5.0/org/apache/beam/sdk/io/parquet/ParquetIO.html ).
ParquetIO.Sink允许您将GenericRecord的PCollection写入Parquet文件(来自此处https://beam.apache.org/releases/javadoc/2.5.0/org/apache/beam/sdk/io/parquet/ParquetIO。 html )。 But the parquet file schema is not like what i expected
但是实木复合地板文件架构不符合我的预期
here is my schema: 这是我的架构:
schema = new org.apache.avro.Schema.Parser().parse("{\n" +
" \"type\": \"record\",\n" +
" \"namespace\": \"com.example\",\n" +
" \"name\": \"Patterns\",\n" +
" \"fields\": [\n" +
" { \"name\": \"id\", \"type\": \"string\" },\n" +
" { \"name\": \"name\", \"type\": \"string\" },\n" +
" { \"name\": \"createdAt\", \"type\": {\"type\":\"string\",\"logicalType\":\"timestamps-millis\"} },\n" +
" { \"name\": \"updatedAt\", \"type\": {\"type\":\"string\",\"logicalType\":\"timestamps-millis\"} },\n" +
" { \"name\": \"steps\", \"type\": [\"null\",{\"type\":\"array\",\"items\":{\"type\":\"string\",\"name\":\"json\"}}] },\n" +
" ]\n" +
"}");
this is my code so far: 到目前为止,这是我的代码:
Pipeline p = Pipeline.create(
PipelineOptionsFactory.fromArgs(args).withValidation().create());
p.apply(JdbcIO.<GenericRecord> read()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create(
"org.postgresql.Driver", "jdbc:postgresql://localhost:port/database")
.withUsername("username")
.withPassword("password"))
.withQuery("select * from table limit(10)")
.withCoder(AvroCoder.of(schema))
.withRowMapper((JdbcIO.RowMapper<GenericRecord>) resultSet -> {
GenericRecord record = new GenericData.Record(schema);
ResultSetMetaData metadata = resultSet.getMetaData();
int columnsNumber = metadata.getColumnCount();
for(int i=0; i<columnsNumber; i++) {
Object columnValue = resultSet.getObject(i+1);
if(columnValue instanceof UUID) columnValue=columnValue.toString();
if(columnValue instanceof Timestamp) columnValue=columnValue.toString();
if(columnValue instanceof PgArray) {
Object[] array = (Object[]) ((PgArray) columnValue).getArray();
List list=new ArrayList();
for (Object d : array) {
if(d instanceof PGobject) {
list.add(((PGobject) d).getValue());
}
}
columnValue = list;
}
record.put(i, columnValue);
}
return record;
}))
.apply(FileIO.<GenericRecord>write()
.via(ParquetIO.sink(schema).withCompressionCodec(CompressionCodecName.SNAPPY))
.to("something.parquet")
);
p.run();
this is what i get: 这就是我得到的:
message com.example.table {
required binary id (UTF8);
required binary name (UTF8);
required binary createdAt (UTF8);
required binary updatedAt (UTF8);
optional group someArray (LIST) {
repeated binary array (UTF8);
}
}
this is what i expected: 这是我所期望的:
message com.example.table {
required binary id (UTF8);
required binary name (UTF8);
required binary createdAt (UTF8);
required binary updatedAt (UTF8);
optional repeated binary someArray(UTF8);
}
please help 请帮忙
Is it a protobuf message you used to describe the expected schema? 您用来描述预期架构的protobuf消息吗? I think what you got is correctly generated from the specified JSON schema.
我认为您所得到的是从指定的JSON模式正确生成的。
optional repeated
does not make sense in the protobuf language specification: https://developers.google.com/protocol-buffers/docs/reference/proto2-spec 在protobuf语言规范中,
optional repeated
操作没有意义: https : //developers.google.com/protocol-buffers/docs/reference/proto2-spec
You can remove null
and square bracket to generate simply repeated
field and it's semantically equivalent to optional repeated
(since repeated
means zero or more times). 您可以删除
null
括号和方括号以生成简单的repeated
字段,并且在语义上等效于optional repeated
字段(因为repeated
表示零次或多次)。
I did not find a way to create a repeated element from Avro that isn't in a GroupType. 我没有找到一种方法来从Avro中创建一个不在GroupType中的重复元素。
The ParquetIO in Beam uses a "standard" avro conversion defined in the parquet-mr
project, which is implemented here . Beam中的ParquetIO使用
parquet-mr
项目中定义的“标准” avro转换,在此处实现。
It appears that there are two ways to turn an Avro ARRAY field to a Parquet message -- but neither of them create what you are looking for. 似乎有两种方法可以将Avro ARRAY字段转换为Parquet消息-但这两种都不创建您要查找的内容。
Currently, the avro conversion is the only way to interact with ParquetIO at the moment. 当前,avro转换是目前与ParquetIO交互的唯一方法。 I saw this JIRA Use Beam schema in ParquetIO that extend this to Beam Rows, which might permit a different parquet message strategy.
我在ParquetIO中看到了这个JIRA Use Beam模式,该模式将其扩展到Beam Rows,这可能允许使用不同的Parquet消息策略。
Alternatively, you could create a JIRA feature request for ParquetIO to support thrift structures, which should allow finer control over the parquet structure. 或者,您可以为ParquetIO创建JIRA功能请求以支持节俭结构,这应该允许对Parquet结构进行更好的控制。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.