[英]BigQuery Load data crash with bad column names
I have bunch of CSV files with bad column names, like "AB/C" where I copied in GCP Bucket and try to load them to BQ from the console itself(cannot change the col names in source file).我有一堆 CSV 列名错误的文件,比如我在 GCP Bucket 中复制的“AB/C”,并尝试从控制台本身将它们加载到 BQ(无法更改源文件中的列名)。 When I create the table from loading the first csv, BQ is renaming columns to "A_B_C" which is fine, but when I try to append 2nd CSV file to the table, it throws an error that "Could not parse '2019/08/14' as DATE for field A_B_C (position 0) starting at location 77".
当我通过加载第一个 csv 创建表时,BQ 将列重命名为“A_B_C”,这很好,但是当我尝试将 append 第二个 CSV 文件添加到表中时,它会抛出错误“无法解析‘2019/08/ 14' 作为字段 A_B_C(位置 0)从位置 77" 开始的日期。 A_B_C is the first column in CSV that is why it is referring to the date.
A_B_C 是 CSV 中的第一列,这就是它引用日期的原因。 IMO it has nothing to do with the date.
IMO 它与日期无关。 The format of date in CSV is YYYY-MM-DD which is inline with BQ requirements.
CSV中的日期格式为YYYY-MM-DD,符合BQ要求。 I even changed the schema to modify A_B_C to STRING so in case there is any issue with date column is resolves, but it is still the same.
我什至更改了架构以将 A_B_C 修改为 STRING,以防万一日期列有任何问题得到解决,但它仍然是一样的。
I also skip the 1st row in 2nd CSV load so it does not bother about column headers, but still no chance.我还跳过了第二个 CSV 加载中的第一行,因此它不会打扰列标题,但仍然没有机会。
Any suggestion?有什么建议吗?
PS - obviously using *,? PS - 显然使用 *,? to load multiple CSV file at once always fail for obvious reason, I did not mention it in the question to avoid further confusion.
一次加载多个 CSV 文件总是失败,原因很明显,我没有在问题中提及它以避免进一步混淆。
While you did not share the script you are using neither the language.虽然您没有共享脚本,但您没有使用任何一种语言。 I was able to reproduce your case loading more than one.csv file appending them to the same table using a Python script.
我能够使用 Python 脚本重现您的案例加载多个 .csv 文件并将它们附加到同一个表。 I used the same column name and format type you mentioned in the question.
我使用了您在问题中提到的相同的列名和格式类型。
In order to achieve what I described above, I used the blob
package inside the cloud.storage
, so the script could go through all the.csv files with the same prefix.为了实现我上面描述的内容,我在 cloud.storage 中使用了
blob
cloud.storage
,因此脚本可以 go 遍历所有具有相同前缀的 .csv 文件。 I also used job_config.autodetect = True
and job_config.skip_leading_rows = 1
to autodetect the schema and skip the first row, which is the header. Below is the script:我还使用
job_config.autodetect = True
和job_config.skip_leading_rows = 1
来自动检测模式并跳过第一行,即 header。下面是脚本:
from google.cloud import bigquery
from google.cloud.storage import Blob
from google.cloud import storage
client = bigquery.Client()
dataset_id = 'dataset_name'
client_bucket= storage.Client()
#bucket where the .csv files are stored
bucket = client_bucket.bucket('bucket_name')
dataset_ref = client.dataset(dataset_id)
job_config = bigquery.LoadJobConfig()
job_config.autodetect = True
job_config.skip_leading_rows = 1
# The source format defaults to CSV, so the line below is optional.
job_config.source_format = bigquery.SourceFormat.CSV
#append to the table
job_config.write_disposition = bigquery.WriteDisposition.WRITE_APPEND
#it will loop and load all the .csv files with the same prefix(test_)
#loading them in the created table table_output
for blob in bucket.list_blobs(prefix='test_'):
file_name = blob.name
#uri location for the .csv file which will be uploaded
uri = "gs://bucket_name/"+file_name
load_job = client.load_table_from_uri(
uri, dataset_ref.table("table_output"), job_config=job_config) # API request
#checking the uploads
print("Starting job {}".format(load_job.job_id))
load_job.result() # Waits for table load to complete.
print("Job finished.")
destination_table = client.get_table(dataset_ref.table(file_name.split('.')[0]))
print("Loaded {} rows.".format(destination_table.num_rows))
As for the documents, I used two sample.csv files:至于文档,我使用了两个 sample.csv 文件:
test_1.csv test_1.csv
A.B/C,other_field
2020-06-09,test
test_2.csv test_2.csv
A.B/C,other_field
2020-06-10,test
Therefore, the script will go through the bucket and find all the.csv files with the prefix text_
and upload them to the same table.因此,脚本会通过go这个bucket,找到所有.csv前缀为
text_
的文件,上传到同一个表中。 Notice that, I used the same format for the date field as well as the field name AB/C
.请注意,我对日期字段以及字段名称
AB/C
使用了相同的格式。 After running the script within Cloud Shell, the load was successful and the output: Cloud Shell内运行脚本后,加载成功,output:
and the schema,和模式,
As it is shown above, the field's name AB/C
was changed to A_B_C
and the field format was automatically detected to DATE
without any errors.如上图,字段名
AB/C
改为A_B_C
,字段格式自动检测为DATE
,没有任何错误。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.