Json 在 Vertex AI 上使用 Apache 光束时出现加载错误

Question

I am trying to load data from datastore to bigquery using Apache beam in Vertex AI notebook.我正在尝试使用Vertex AI笔记本中的Apache beam将数据从数据存储加载到bigquery 。 This is the code part where the loading happens-这是加载发生的代码部分-

from apache_beam.io.gcp.datastore.v1new.datastoreio import ReadFromDatastore
import apache_beam as beam
from apache_beam.io.gcp.bigquery_file_loads import BigQueryBatchFileLoads

table_row = (p 
                 | 'DatastoreGetData' >> ReadFromDatastore(query=myquery)
                 | 'EntityConversion' >> beam.Map(ent_to_json_func, table_schema)
                 | 'Final' >> BigQueryBatchFileLoads(
                      destination=lambda row: f"myproject:dataset.mytable",
                      custom_gcs_temp_location=f'gs://myproject/beam',
                      write_disposition='WRITE_TRUNCATE',
                      schema=table_schema
                 )
                )

table_schema is the json version of BigQuery table schema (attached Column mapping pic below). table_schema 是 json 版本的 BigQuery 表架构（下面附有列映射图片）。 ent_to_json_func converts fields coming from datastore to corresponding BigQuery field in the correct format. ent_to_json_func 将来自数据存储的字段转换为正确格式的相应 BigQuery 字段。

I am trying to load just one row from datastore, it is giving error.我正在尝试从数据存储中加载一行，但出现错误。 The data looks like this-数据看起来像这样——

{ " key ": { "namespace": null, "app": null, "path": "Table/12345678", "kind": "Mykind", "name": null, "id": 12345678 }, "col1": false, "col2": { "namespace": null, { " key ": { "namespace": null, "app": null, "path": "Table/12345678", "kind": "Mykind", "name": null, "id": 12345678 }, " col1": false, "col2": { "命名空间": null,
"app": null, "path": "abc/12345", "kind": "abc", "name": null, "id": 12345 }, "col3": "6835218432", "col4": { “app”：null，“path”：“abc/12345”，“kind”：“abc”，“name”：null，“id”：12345}，“col3”：“6835218432”，“col4”：{
"namespace": null, "app": null, "path": null, "kind": null, "name": null, "id": null }, "col5": false, “命名空间”：null，“应用程序”：null，“路径”：null，“种类”：null，“名称”：null，“id”：null}，“col5”：假，
"col6": null, "col7": "https://www.somewebsite.com/poi/", “col6”：null，“col7”：“https://www.somewebsite.com/poi/”，
"col8": "0.00", "col9": "2022-03-12 03:44:17.732193+00:00", “col8”：“0.00”，“col9”：“2022-03-12 03:44:17.732193+00:00”，
"col10": "{"someid":"NAME","col7":"https://www.somewebsite.com/poi/", "provided":"Yes","someid2":"SDFTYI1090"}", "col11": ""0.00"", "col12": "{}", "col13": [] } "col10": "{"someid":"NAME","col7":"https://www.somewebsite.com/poi/", "provided":"是","someid2":"SDFTYI1090"}" , "col11": ""0.00", "col12": "{}", "col13": [] }

The column mapping is here列映射在这里

The error is as follows-错误如下-

--------------------------------------------------------------------------- 
RuntimeError                              Traceback (most recent call last) 
~/apache-beam-2.41.0/packages/beam/sdks/python/apache_beam/runners/common.py in process(self, windowed_value)
   1416     try:
-> 1417       return self.do_fn_invoker.invoke_process(windowed_value)
   1418     except BaseException as exn:

~/apache-beam-2.41.0/packages/beam/sdks/python/apache_beam/runners/common.py in invoke_process(self, windowed_value, restriction, watermark_estimator_state, additional_args, additional_kwargs)
    837       self._invoke_process_per_window(
--> 838           windowed_value, additional_args, additional_kwargs)
    839     return residuals

~/apache-beam-2.41.0/packages/beam/sdks/python/apache_beam/runners/common.py in _invoke_process_per_window(self, windowed_value, additional_args, additional_kwargs)
    982         windowed_value,
--> 983         self.process_method(*args_for_process, **kwargs_for_process),
    984         self.threadsafe_watermark_estimator)

~/apache-beam-2.41.0/packages/beam/sdks/python/apache_beam/io/gcp/bigquery_file_loads.py in process(self, element, dest_ids_list)
    753       # max_retries to 0.
--> 754       self.bq_wrapper.wait_for_bq_job(ref, sleep_duration_sec=10, max_retries=0)
    755 

~/apache-beam-2.41.0/packages/beam/sdks/python/apache_beam/io/gcp/bigquery_tools.py in wait_for_bq_job(self, job_reference, sleep_duration_sec, max_retries)
    637             'BigQuery job {} failed. Error Result: {}'.format(
--> 638                 job_reference.jobId, job.status.errorResult))
    639       elif job.status.state == 'DONE':

RuntimeError: BigQuery job beam_bq_job_LOAD_AUTOMATIC_JOB_NAME_LOAD_STEP_187_4cab298bbd73af86496c64ca35602a05_a5309204fb004ae0ba8007ac2169e079 failed. 
Error Result: <ErrorProto  
location: 'gs://myproject/beam/bq_load/63e94c1a210742aabab09f96/myproject.dataset.mytable/aed960fb-0ae6-489a-9cb8-e012eee0d9c8'  
message: 'Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details. 
File: gs://myproject/beam/bq_load/63e94c1a210742aabab09f96/myproject.dataset.mytable/aed960fb-0ae6-489a-9cb8-e012eee0d9c8'  
reason: 'invalid'>

Please let me know what to do.请让我知道该怎么做。 How to determine the exact cause of the error?如何确定错误的确切原因？ I have also validated this json through a json validator, there is no issue.我还通过 json 验证器验证了这个 json，没有问题。

UPDATE:更新：

I found out that the issue is due to BYTES column.我发现问题出在 BYTES 列上。 From datastore, bytes type is coming, which I am converting to string using decode and saving in json. When I upload that json into BigQuery, it gives error.从数据存储中，字节类型即将到来，我正在使用decode将其转换为字符串并保存在 json 中。当我将 json 上传到 BigQuery 时，它会出错。

How to proceed in this case?在这种情况下如何进行？

Answer 1

To solve your issue you have to set STRING type for COL10 COL11 COL12 and COL13 in the BigQuery table.要解决您的问题，您必须在BigQuery表中为COL10 COL11 COL12和COL13设置STRING类型。

Your final Dict in the PCollection need to match exactly the schema of the BigQuery table. PCollection中的最终Dict需要与BigQuery表的架构完全匹配。

In your json I saw these columns as String , your schema needs also to have STRING type for these columns.在您的json ，我将这些列视为String ，您的架构还需要为这些列设置STRING类型。

Answer 2

The error message is accurate.错误消息是准确的。 The file attempted to load to bigquery is not valid json string, see "{"someid":"NAME","col7":"https://www.somewebsite.com/poi/" there are quotes without escape. The file attempted to load to bigquery is not valid json string, see "{"someid":"NAME","col7":"https://www.somewebsite.com/poi/"里面有没有转义的引号。

Load to bytes type should actually be fine.加载到字节类型实际上应该没问题。

Json 在 Vertex AI 上使用 Apache 光束时出现加载错误

问题描述

UPDATE:更新：

2 个解决方案

解决方案1
0 2023-01-17 10:04:35

解决方案2
0 2023-01-21 22:06:35

Json 在 Vertex AI 上使用 Apache 光束时出现加载错误

问题描述

UPDATE:更新：

2 个解决方案

解决方案1 0 2023-01-17 10:04:35

解决方案2 0 2023-01-21 22:06:35

解决方案1
0 2023-01-17 10:04:35

解决方案2
0 2023-01-21 22:06:35