BigQuery Streaming Insert Error - 在数组外添加重复记录

Question

I'm facing a weird problem while using Dataflow Streaming Insert.我在使用 Dataflow Streaming Insert 时遇到了一个奇怪的问题。 I have a JSON with a lot of records and arrays.我有一个包含大量记录和数组的 JSON。 I set up the Pipeline with Streaming Insert method and a class DeadLetters to handle the errors.我使用流式插入方法和类 DeadLetters 设置了管道来处理错误。

formattedWaiting.apply("Insert Bigquery ",
                BigQueryIO.<KV<TableRow,String>>write()
                .to(customOptions.getOutputTable())
                .withFormatFunction(kv -> kv.getKey())
                .withMethod(BigQueryIO.Write.Method.STREAMING_INSERTS)
                .withSchemaFromView(schema)
                .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
                .withWriteDisposition(WriteDisposition.WRITE_APPEND)
                .withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErrors())
                .withoutValidation()
                .withExtendedErrorInfo()
                .withTimePartitioning(new TimePartitioning().setField(customOptions.getPartitionField().get()))
                .withClustering(clusteringFieldsList)
                .withExtendedErrorInfo())
                .getFailedInsertsWithErr()
                .apply("Taking 1 element insertion", Sample.<BigQueryInsertError>any(1))
                .apply("Insertion errors",ParDo.of(new DeadLettersHandler()));

The problem is when I'm using the streaming insert method, some rows don't insert into the table and I'm receiving the error:问题是当我使用流式插入方法时，有些行没有插入到表中，我收到了错误：

Repeated record with name: XXXX added outside of an array.在数组外添加名称为 XXXX 的重复记录。

I double-checked the JSON that has the problem and everything seems fine.我仔细检查了有问题的 JSON，一切似乎都很好。 The weird part is when I comment the withMethod line, the row insert with no issue at all.奇怪的部分是当我评论withMethod行时，行插入完全没有问题。

I don't know why the pipeline has that behavior.我不知道为什么管道有这种行为。

The JSON looks like this. JSON 看起来像这样。

{
   "parameters":{
      "parameter":[
         {
            "subParameter":[
               {
                  "value":"T",
                  "key":"C"
               },
               {
                  "value":"1",
                  "key":"SEQUENCE_NUMBER"
               },
               {
                  "value":"1",
                  "key":"SEQUENCE_NUMBER"
               }
            ],
            "value":"C",
            "key":"C"
         },
         {
            "subParameter":[
               {
                  "value":"T",
                  "key":"C"
               },
               {
                  "value":"1",
                  "key":"SEQUENCE_NUMBER"
               },
               {
                  "value":"2",
                  "key":"SEQUENCE_NUMBER"
               }
            ],
            "value":"C",
            "key":"C"
         }
      ]
   }
}

The BigQuery schema is fine because I can insert data while commenting the streaming insert line in the BigQueryIO BigQuery 架构很好，因为我可以在评论 BigQueryIO 中的流式插入行时插入数据

Any idea fellows?有想法的小伙伴吗？

Thanks in advance!提前致谢！

Answer 1

Just an update to this question.只是对这个问题的更新。

The problem was with the schema declaration and the JSON itself.问题在于模式声明和 JSON 本身。

We defined the parameters column as RECORD REPEATED but parameters is an object in the JSON example.我们将parameters列定义为RECORD REPEATED但parameters是 JSON 示例中的对象。

So we have two options here.所以我们在这里有两个选择。

Change the BigQuery schema from RECORD REPEATED to RECORD NULLABLE将 BigQuery 架构从RECORD REPEATED更改为RECORD NULLABLE
Add a bracket [] to the parameters object, for this option you will have to transform the JSON and add the brackets to treat the object as an array.将括号 [] 添加到parameters对象，对于此选项，您必须转换 JSON 并添加括号以将对象视为数组。

Example:例子：

{
   "parameters":[
      {
         "parameter":[
            {
               "subParameter":[
                  {
                     "value":"T",
                     "key":"C"
                  },
                  {
                     "value":"1",
                     "key":"SEQUENCE_NUMBER"
                  },
                  {
                     "value":"1",
                     "key":"SEQUENCE_NUMBER"
                  }
               ],
               "value":"C",
               "key":"C"
            }
         ]
      }
   ]
}

BigQuery Streaming Insert Error - 在数组外添加重复记录

问题描述

1 个解决方案

解决方案1
0 2021-11-20 13:37:24

BigQuery Streaming Insert Error - 在数组外添加重复记录

问题描述

1 个解决方案

解决方案1 0 2021-11-20 13:37:24

解决方案1
0 2021-11-20 13:37:24