[英]BigQuery Streaming Insert Error - Repeated record added outside of an array
I'm facing a weird problem while using Dataflow Streaming Insert.我在使用 Dataflow Streaming Insert 时遇到了一个奇怪的问题。 I have a JSON with a lot of records and arrays.
我有一个包含大量记录和数组的 JSON。 I set up the Pipeline with Streaming Insert method and a class DeadLetters to handle the errors.
我使用流式插入方法和类 DeadLetters 设置了管道来处理错误。
formattedWaiting.apply("Insert Bigquery ",
BigQueryIO.<KV<TableRow,String>>write()
.to(customOptions.getOutputTable())
.withFormatFunction(kv -> kv.getKey())
.withMethod(BigQueryIO.Write.Method.STREAMING_INSERTS)
.withSchemaFromView(schema)
.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErrors())
.withoutValidation()
.withExtendedErrorInfo()
.withTimePartitioning(new TimePartitioning().setField(customOptions.getPartitionField().get()))
.withClustering(clusteringFieldsList)
.withExtendedErrorInfo())
.getFailedInsertsWithErr()
.apply("Taking 1 element insertion", Sample.<BigQueryInsertError>any(1))
.apply("Insertion errors",ParDo.of(new DeadLettersHandler()));
The problem is when I'm using the streaming insert method, some rows don't insert into the table and I'm receiving the error:问题是当我使用流式插入方法时,有些行没有插入到表中,我收到了错误:
Repeated record with name: XXXX added outside of an array.在数组外添加名称为 XXXX 的重复记录。
I double-checked the JSON that has the problem and everything seems fine.我仔细检查了有问题的 JSON,一切似乎都很好。 The weird part is when I comment the withMethod line, the row insert with no issue at all.
奇怪的部分是当我评论withMethod行时,行插入完全没有问题。
I don't know why the pipeline has that behavior.我不知道为什么管道有这种行为。
The JSON looks like this. JSON 看起来像这样。
{
"parameters":{
"parameter":[
{
"subParameter":[
{
"value":"T",
"key":"C"
},
{
"value":"1",
"key":"SEQUENCE_NUMBER"
},
{
"value":"1",
"key":"SEQUENCE_NUMBER"
}
],
"value":"C",
"key":"C"
},
{
"subParameter":[
{
"value":"T",
"key":"C"
},
{
"value":"1",
"key":"SEQUENCE_NUMBER"
},
{
"value":"2",
"key":"SEQUENCE_NUMBER"
}
],
"value":"C",
"key":"C"
}
]
}
}
The BigQuery schema is fine because I can insert data while commenting the streaming insert line in the BigQueryIO BigQuery 架构很好,因为我可以在评论 BigQueryIO 中的流式插入行时插入数据
Any idea fellows?有想法的小伙伴吗?
Thanks in advance!提前致谢!
Just an update to this question.只是对这个问题的更新。
The problem was with the schema declaration and the JSON itself.问题在于模式声明和 JSON 本身。
We defined the parameters
column as RECORD REPEATED
but parameters
is an object in the JSON example.我们将
parameters
列定义为RECORD REPEATED
但parameters
是 JSON 示例中的对象。
So we have two options here.所以我们在这里有两个选择。
RECORD REPEATED
to RECORD NULLABLE
RECORD REPEATED
更改为RECORD NULLABLE
parameters
object, for this option you will have to transform the JSON and add the brackets to treat the object as an array.parameters
对象,对于此选项,您必须转换 JSON 并添加括号以将对象视为数组。 Example:例子:
{
"parameters":[
{
"parameter":[
{
"subParameter":[
{
"value":"T",
"key":"C"
},
{
"value":"1",
"key":"SEQUENCE_NUMBER"
},
{
"value":"1",
"key":"SEQUENCE_NUMBER"
}
],
"value":"C",
"key":"C"
}
]
}
]
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.