[英]Load csv.gz file from google storage to bigquery using python
I want to load csv.gz file from storage to bigquery. 我想将csv.gz文件从存储加载到bigquery。 Right now I using below code, but I am not sure if it is efficient way to load data to bigquery. 现在我使用下面的代码,但是我不确定将数据加载到bigquery是否有效。
# -*- coding: utf-8 -*-
from io import BytesIO
import pandas as pd
from google.cloud import storage
import pandas_gbq as gbq
client = storage.Client.from_service_account_json(service_account)
bucket = client.get_bucket("bucketname")
blob = storage.blob.Blob("""somefile.csv.gz""", bucket)
content = blob.download_as_string()
df = pd.read_csv(BytesIO(content), delimiter=',', quotechar='"', low_memory=False)
df = df.astype(str)
df.columns = df.columns.str.replace("|", "")
df["dateinsert"] = pd.datetime.now()
gbq.to_gbq(df, 'desttable',
'projectid',
chunksize=None,
if_exists='append'
)
Please assist me to write this code in efficient way 请协助我以有效的方式编写此代码
I propose you this process: 我建议您执行以下过程:
job_config.skip_leading_rows = 1
添加跳过首行选项以跳过标题job_config.skip_leading_rows = 1
<dataset>.<tableBaseName>_<Datetime>
The date time must be a string format compliant with BigQuery table name. 像这样<dataset>.<tableBaseName>_<Datetime>
一样命名表<dataset>.<tableBaseName>_<Datetime>
日期时间必须是与BigQuery表名兼容的字符串格式。 For example YYYYMMDDHHMM
例如YYYYMMDDHHMM
When you query your data, you can query a subset of table, and inject the table name in the query result, like this: 查询数据时,可以查询表的子集,并将表名插入查询结果中,如下所示:
SELECT *,(SELECT table_id
FROM `<project>.<dataset>.__TABLES_SUMMARY__`
WHERE table_id LIKE '<tableBaseName>%') FROM `<project>.<dataset>.<tableBaseName>*`
Of course, you can raffine the * with the year, month, day,... 当然,您可以使用年,月,日,...来拼写*
I think, I meet all your requirements. 我想,我满足您的所有要求。 Comment if something goes wrong 如果出现问题请发表评论
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.