[英]Loading data from GCS to BigQuery using Wildcards and Autodetect
New to posting on StackOverflow. 在StackOverflow上发布的新内容。
Using the google.cloud.bigquery python SDK, I have been trying to work up a solution to load data from GCS to BigQuery without defining a table schema. 我一直在尝试使用google.cloud.bigquery python SDK开发一种解决方案,以在不定义表架构的情况下将数据从GCS加载到BigQuery。
My LoadJobConfig
's autodetect is set to True and I am using a wildcard (*) in the GCS URI. 我的
LoadJobConfig
的自动检测设置为True,并且我在GCS URI中使用通配符(*)。
I have confirmed that Autodetect
works with wildcards but the load job fails because the data source that I am working with will usually autodetect a specific column to be a float (eg 0.30) but sometimes adds operator symbols (eg < 0.10) and thus needs to be a string. 我已经确认
Autodetect
可以使用通配符,但是加载作业会失败,因为我正在使用的数据源通常会自动检测特定列为浮点(例如0.30),但有时会添加运算符(例如<0.10),因此需要是一个字符串。
Can anyone think of a solution without having to define the schema? 任何人都可以在无需定义架构的情况下想到解决方案吗? Here's my
LoadJobConfig
that I've passed to bigquery.client.Client
's load_table_from_uri
method. 这里是我的
LoadJobConfig
,我已经传给bigquery.client.Client
的load_table_from_uri
方法。
source_uri = 'gs://%s/%s/%s/*' % (source, report_type, date)
job_config = bigquery.LoadJobConfig()
job_config.create_disposition = 'CREATE_IF_NEEDED'
job_config.skip_leading_rows = 1
job_config.source_format = 'CSV'
job_config.write_disposition = 'WRITE_TRUNCATE'
job_config.autodetect = True
job = bigquery_client.load_table_from_uri(source_uri, table_ref, job_config=job_config)
job.result()
Your data seems to be broken in some part. 您的数据似乎在某种程度上被破坏了。
I would suggest to use flag: --max_bad_records
which skips broken records. 我建议使用标志:--
--max_bad_records
跳过损坏的记录。
For details please look here: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv#bigquery-import-gcs-file-python 有关详细信息,请参见此处: https : //cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv#bigquery-import-gcs-file-python
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.