[英]Json loading error using Apache beam on Vertex AI
I am trying to load data from datastore to bigquery using Apache beam in Vertex AI notebook.我正在尝试使用Vertex AI笔记本中的Apache beam将数据从数据存储加载到bigquery 。 This is the code part where the loading happens-
这是加载发生的代码部分-
from apache_beam.io.gcp.datastore.v1new.datastoreio import ReadFromDatastore
import apache_beam as beam
from apache_beam.io.gcp.bigquery_file_loads import BigQueryBatchFileLoads
table_row = (p
| 'DatastoreGetData' >> ReadFromDatastore(query=myquery)
| 'EntityConversion' >> beam.Map(ent_to_json_func, table_schema)
| 'Final' >> BigQueryBatchFileLoads(
destination=lambda row: f"myproject:dataset.mytable",
custom_gcs_temp_location=f'gs://myproject/beam',
write_disposition='WRITE_TRUNCATE',
schema=table_schema
)
)
table_schema is the json version of BigQuery table schema (attached Column mapping pic below). table_schema 是 json 版本的 BigQuery 表架构(下面附有列映射图片)。 ent_to_json_func converts fields coming from datastore to corresponding BigQuery field in the correct format.
ent_to_json_func 将来自数据存储的字段转换为正确格式的相应 BigQuery 字段。
I am trying to load just one row from datastore, it is giving error.我正在尝试从数据存储中加载一行,但出现错误。 The data looks like this-
数据看起来像这样——
{ " key ": { "namespace": null, "app": null, "path": "Table/12345678", "kind": "Mykind", "name": null, "id": 12345678 }, "col1": false, "col2": { "namespace": null,
{ " key ": { "namespace": null, "app": null, "path": "Table/12345678", "kind": "Mykind", "name": null, "id": 12345678 }, " col1": false, "col2": { "命名空间": null,
"app": null, "path": "abc/12345", "kind": "abc", "name": null, "id": 12345 }, "col3": "6835218432", "col4": {“app”:null,“path”:“abc/12345”,“kind”:“abc”,“name”:null,“id”:12345},“col3”:“6835218432”,“col4”:{
"namespace": null, "app": null, "path": null, "kind": null, "name": null, "id": null }, "col5": false,“命名空间”:null,“应用程序”:null,“路径”:null,“种类”:null,“名称”:null,“id”:null},“col5”:假,
"col6": null, "col7": "https://www.somewebsite.com/poi/",“col6”:null,“col7”:“https://www.somewebsite.com/poi/”,
"col8": "0.00", "col9": "2022-03-12 03:44:17.732193+00:00",“col8”:“0.00”,“col9”:“2022-03-12 03:44:17.732193+00:00”,
"col10": "{"someid":"NAME","col7":"https://www.somewebsite.com/poi/", "provided":"Yes","someid2":"SDFTYI1090"}", "col11": ""0.00"", "col12": "{}", "col13": [] }"col10": "{"someid":"NAME","col7":"https://www.somewebsite.com/poi/", "provided":"是","someid2":"SDFTYI1090"}" , "col11": ""0.00", "col12": "{}", "col13": [] }
The column mapping is here列映射在这里
The error is as follows-错误如下-
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
~/apache-beam-2.41.0/packages/beam/sdks/python/apache_beam/runners/common.py in process(self, windowed_value)
1416 try:
-> 1417 return self.do_fn_invoker.invoke_process(windowed_value)
1418 except BaseException as exn:
~/apache-beam-2.41.0/packages/beam/sdks/python/apache_beam/runners/common.py in invoke_process(self, windowed_value, restriction, watermark_estimator_state, additional_args, additional_kwargs)
837 self._invoke_process_per_window(
--> 838 windowed_value, additional_args, additional_kwargs)
839 return residuals
~/apache-beam-2.41.0/packages/beam/sdks/python/apache_beam/runners/common.py in _invoke_process_per_window(self, windowed_value, additional_args, additional_kwargs)
982 windowed_value,
--> 983 self.process_method(*args_for_process, **kwargs_for_process),
984 self.threadsafe_watermark_estimator)
~/apache-beam-2.41.0/packages/beam/sdks/python/apache_beam/io/gcp/bigquery_file_loads.py in process(self, element, dest_ids_list)
753 # max_retries to 0.
--> 754 self.bq_wrapper.wait_for_bq_job(ref, sleep_duration_sec=10, max_retries=0)
755
~/apache-beam-2.41.0/packages/beam/sdks/python/apache_beam/io/gcp/bigquery_tools.py in wait_for_bq_job(self, job_reference, sleep_duration_sec, max_retries)
637 'BigQuery job {} failed. Error Result: {}'.format(
--> 638 job_reference.jobId, job.status.errorResult))
639 elif job.status.state == 'DONE':
RuntimeError: BigQuery job beam_bq_job_LOAD_AUTOMATIC_JOB_NAME_LOAD_STEP_187_4cab298bbd73af86496c64ca35602a05_a5309204fb004ae0ba8007ac2169e079 failed.
Error Result: <ErrorProto
location: 'gs://myproject/beam/bq_load/63e94c1a210742aabab09f96/myproject.dataset.mytable/aed960fb-0ae6-489a-9cb8-e012eee0d9c8'
message: 'Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details.
File: gs://myproject/beam/bq_load/63e94c1a210742aabab09f96/myproject.dataset.mytable/aed960fb-0ae6-489a-9cb8-e012eee0d9c8'
reason: 'invalid'>
Please let me know what to do.请让我知道该怎么做。 How to determine the exact cause of the error?
如何确定错误的确切原因? I have also validated this json through a json validator, there is no issue.
我还通过 json 验证器验证了这个 json,没有问题。
I found out that the issue is due to BYTES column.我发现问题出在 BYTES 列上。 From datastore, bytes type is coming, which I am converting to string using
decode
and saving in json. When I upload that json into BigQuery, it gives error.从数据存储中,字节类型即将到来,我正在使用
decode
将其转换为字符串并保存在 json 中。当我将 json 上传到 BigQuery 时,它会出错。
How to proceed in this case?在这种情况下如何进行?
To solve your issue you have to set STRING
type for COL10
COL11
COL12
and COL13
in the BigQuery
table.要解决您的问题,您必须在
BigQuery
表中为COL10
COL11
COL12
和COL13
设置STRING
类型。
Your final Dict
in the PCollection
need to match exactly the schema of the BigQuery
table. PCollection
中的最终Dict
需要与BigQuery
表的架构完全匹配。
In your json
I saw these columns as String
, your schema needs also to have STRING
type for these columns.在您的
json
,我将这些列视为String
,您的架构还需要为这些列设置STRING
类型。
The error message is accurate.错误消息是准确的。 The file attempted to load to bigquery is not valid json string, see
"{"someid":"NAME","col7":"https://www.somewebsite.com/poi/"
there are quotes without escape. The file attempted to load to bigquery is not valid json string, see
"{"someid":"NAME","col7":"https://www.somewebsite.com/poi/"
里面有没有转义的引号。
Load to bytes type should actually be fine.加载到字节类型实际上应该没问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.