简体   繁体   English

BigQuery Streaming Insert Error - 在数组外添加重复记录

[英]BigQuery Streaming Insert Error - Repeated record added outside of an array

I'm facing a weird problem while using Dataflow Streaming Insert.我在使用 Dataflow Streaming Insert 时遇到了一个奇怪的问题。 I have a JSON with a lot of records and arrays.我有一个包含大量记录和数组的 JSON。 I set up the Pipeline with Streaming Insert method and a class DeadLetters to handle the errors.我使用流式插入方法和类 DeadLetters 设置了管道来处理错误。

formattedWaiting.apply("Insert Bigquery ",
                BigQueryIO.<KV<TableRow,String>>write()
                .to(customOptions.getOutputTable())
                .withFormatFunction(kv -> kv.getKey())
                .withMethod(BigQueryIO.Write.Method.STREAMING_INSERTS)
                .withSchemaFromView(schema)
                .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
                .withWriteDisposition(WriteDisposition.WRITE_APPEND)
                .withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErrors())
                .withoutValidation()
                .withExtendedErrorInfo()
                .withTimePartitioning(new TimePartitioning().setField(customOptions.getPartitionField().get()))
                .withClustering(clusteringFieldsList)
                .withExtendedErrorInfo())
                .getFailedInsertsWithErr()
                .apply("Taking 1 element insertion", Sample.<BigQueryInsertError>any(1))
                .apply("Insertion errors",ParDo.of(new DeadLettersHandler()));

The problem is when I'm using the streaming insert method, some rows don't insert into the table and I'm receiving the error:问题是当我使用流式插入方法时,有些行没有插入到表中,我收到了错误:

Repeated record with name: XXXX added outside of an array.在数组外添加名称为 XXXX 的重复记录。

I double-checked the JSON that has the problem and everything seems fine.我仔细检查了有问题的 JSON,一切似乎都很好。 The weird part is when I comment the withMethod line, the row insert with no issue at all.奇怪的部分是当我评论withMethod行时,行插入完全没有问题。

I don't know why the pipeline has that behavior.我不知道为什么管道有这种行为。

The JSON looks like this. JSON 看起来像这样。

{
   "parameters":{
      "parameter":[
         {
            "subParameter":[
               {
                  "value":"T",
                  "key":"C"
               },
               {
                  "value":"1",
                  "key":"SEQUENCE_NUMBER"
               },
               {
                  "value":"1",
                  "key":"SEQUENCE_NUMBER"
               }
            ],
            "value":"C",
            "key":"C"
         },
         {
            "subParameter":[
               {
                  "value":"T",
                  "key":"C"
               },
               {
                  "value":"1",
                  "key":"SEQUENCE_NUMBER"
               },
               {
                  "value":"2",
                  "key":"SEQUENCE_NUMBER"
               }
            ],
            "value":"C",
            "key":"C"
         }
      ]
   }
}

The BigQuery schema is fine because I can insert data while commenting the streaming insert line in the BigQueryIO BigQuery 架构很好,因为我可以在评论 BigQueryIO 中的流式插入行时插入数据

Any idea fellows?有想法的小伙伴吗?

Thanks in advance!提前致谢!

Just an update to this question.只是对这个问题的更新。

The problem was with the schema declaration and the JSON itself.问题在于模式声明和 JSON 本身。

We defined the parameters column as RECORD REPEATED but parameters is an object in the JSON example.我们将parameters列定义为RECORD REPEATEDparameters是 JSON 示例中的对象。

So we have two options here.所以我们在这里有两个选择。

  1. Change the BigQuery schema from RECORD REPEATED to RECORD NULLABLE将 BigQuery 架构从RECORD REPEATED更改为RECORD NULLABLE
  2. Add a bracket [] to the parameters object, for this option you will have to transform the JSON and add the brackets to treat the object as an array.将括号 [] 添加到parameters对象,对于此选项,您必须转换 JSON 并添加括号以将对象视为数组。

Example:例子:

{
   "parameters":[
      {
         "parameter":[
            {
               "subParameter":[
                  {
                     "value":"T",
                     "key":"C"
                  },
                  {
                     "value":"1",
                     "key":"SEQUENCE_NUMBER"
                  },
                  {
                     "value":"1",
                     "key":"SEQUENCE_NUMBER"
                  }
               ],
               "value":"C",
               "key":"C"
            }
         ]
      }
   ]
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何对存储在 RECORD 或 REPEATED 属性中的 BigQuery 数据进行去标识化? - How to de-identify BigQuery data that stored in RECORD or REPEATED properties? 将数据流式传输到BigQuery时出现Java代码错误 - Java Code error while Streaming Data into BigQuery 如何在BigQuery中获取文件加载插入失败的插入记录 - How to get failed insert record for file load insertion in BigQuery GCP - Bigquery 到 Kafka 作为流 - GCP - Bigquery to Kafka as streaming 错误:格式错误的记录文字:- 带有自定义类型数组的 JDBC 插入数据 - ERROR: malformed record literal: - JDBC insert data with self defined type array Firebase 确保不重复记录 - Firebase ensuring a record is not repeated 使用Java将JSON流式传输到BigQuery - Streaming JSON into BigQuery using Java 无法将记录插入MySQL,但未显示任何错误 - Not able to insert record to MySQL, but showing no error 在java中录制流媒体音频? - Record streaming audio in java? org.hibernate.MappingException:实体映射中的重复列:...列:add_by(应使用 insert=“false” update=“false”进行映射) - org.hibernate.MappingException: Repeated column in mapping for entity:…column: added_by (should be mapped with insert=“false” update=“false”)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM