BigQuery 加载数据因列名错误而崩溃

Question

我有一堆 CSV 列名错误的文件，比如我在 GCP Bucket 中复制的“AB/C”，并尝试从控制台本身将它们加载到 BQ（无法更改源文件中的列名）。 当我通过加载第一个 csv 创建表时，BQ 将列重命名为“A_B_C”，这很好，但是当我尝试将 append 第二个 CSV 文件添加到表中时，它会抛出错误“无法解析‘2019/08/ 14' 作为字段 A_B_C（位置 0）从位置 77" 开始的日期。 A_B_C 是 CSV 中的第一列，这就是它引用日期的原因。 IMO 它与日期无关。 CSV中的日期格式为YYYY-MM-DD，符合BQ要求。 我什至更改了架构以将 A_B_C 修改为 STRING，以防万一日期列有任何问题得到解决，但它仍然是一样的。

我还跳过了第二个 CSV 加载中的第一行，因此它不会打扰列标题，但仍然没有机会。

有什么建议吗？

PS - 显然使用 *,? 一次加载多个 CSV 文件总是失败，原因很明显，我没有在问题中提及它以避免进一步混淆。

Answer 1

虽然您没有共享脚本，但您没有使用任何一种语言。 我能够使用 Python 脚本重现您的案例加载多个 .csv 文件并将它们附加到同一个表。 我使用了您在问题中提到的相同的列名和格式类型。

为了实现我上面描述的内容，我在 cloud.storage 中使用了blob cloud.storage ，因此脚本可以 go 遍历所有具有相同前缀的 .csv 文件。 我还使用job_config.autodetect = True和job_config.skip_leading_rows = 1来自动检测模式并跳过第一行，即 header。下面是脚本：

from google.cloud import bigquery
from google.cloud.storage import Blob
from google.cloud import storage

client = bigquery.Client()
dataset_id = 'dataset_name'

client_bucket= storage.Client()
#bucket where the .csv files are stored
bucket = client_bucket.bucket('bucket_name')


dataset_ref = client.dataset(dataset_id)
job_config = bigquery.LoadJobConfig()
job_config.autodetect = True
job_config.skip_leading_rows = 1
# The source format defaults to CSV, so the line below is optional.
job_config.source_format = bigquery.SourceFormat.CSV
#append to the table
job_config.write_disposition = bigquery.WriteDisposition.WRITE_APPEND


#it will loop and load all the .csv files with the same prefix(test_)
#loading them in the created table table_output
for blob in bucket.list_blobs(prefix='test_'):
    file_name = blob.name
    #uri location for the .csv file which will be uploaded
    uri = "gs://bucket_name/"+file_name
    load_job = client.load_table_from_uri(
        uri, dataset_ref.table("table_output"),            job_config=job_config)  # API request

    #checking the uploads 
    print("Starting job {}".format(load_job.job_id))

    load_job.result()  # Waits for table load to complete.
    print("Job finished.")

    destination_table = client.get_table(dataset_ref.table(file_name.split('.')[0]))
    print("Loaded {} rows.".format(destination_table.num_rows))

至于文档，我使用了两个 sample.csv 文件：

test_1.csv

A.B/C,other_field
2020-06-09,test

test_2.csv

A.B/C,other_field
2020-06-10,test

因此，脚本会通过go这个bucket，找到所有.csv前缀为text_的文件，上传到同一个表中。 请注意，我对日期字段以及字段名称AB/C使用了相同的格式。 Cloud Shell内运行脚本后，加载成功，output：

和模式，

如上图，字段名AB/C改为A_B_C ，字段格式自动检测为DATE ，没有任何错误。

BigQuery 加载数据因列名错误而崩溃

问题描述

1 个解决方案

解决方案1
1 2020-06-10 11:15:32

BigQuery 加载数据因列名错误而崩溃

问题描述

1 个解决方案

解决方案1 1 2020-06-10 11:15:32

解决方案1
1 2020-06-10 11:15:32