[英]Load timestamp from Dataframe to BigQuery dataset
I have a timestamp field loaded_at
in my BigQuery table result_data
and it's epoch equivalent loaded_at_epoch
. 我的BigQuery表
result_data
有一个时间戳字段loaded_at
,它是与epoch等效的loaded_at_epoch
。 I'm using Python to regularly get new data from an external source, add these two fields to the dataframe and load this dataframe in my BigQuery table. 我正在使用Python定期从外部来源获取新数据,将这两个字段添加到数据框中,然后将此数据框加载到我的BigQuery表中。
loaded_at = datetime.utcnow()
loaded_at_epoch = int((loaded_at - datetime(1970, 1, 1)).total_seconds())
df['loaded_at'] = pd.Series(loaded_at, index=df.index)
df['loaded_at_epoch'] = pd.Series(loaded_at_epoch, index=df.index)
job_config = bigquery.LoadJobConfig()
job_config.write_disposition = bigquery.WriteDisposition.WRITE_APPEND
job_config.schema_update_options = [bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION]
bq_client.load_table_from_dataframe(df, result_data, location="US", job_config=job_config,)
It used to work but since a couple of weeks, the loaded_at
has wrong values such as 1970-01-19 03:32:09.693 UTC, while the loaded_at_epoch
has correct timestamp values. 它曾经可以工作,但是几周以来,
loaded_at
值错误,例如1970-01-19 03:32:09.693 UTC,而loaded_at_epoch
时间戳值正确。 It looks like somehow the timestamps are in seconds but are interpreted as being in milliseconds when loaded from the dataframe. 看起来时间戳以秒为单位,但从数据帧加载时以毫秒为单位。
I'm not sure how to make this work. 我不确定如何使这项工作。 I've been trying to have
loaded_at
as a string but then I get an error: google.api_core.exceptions.BadRequest: 400 Provided Schema does not match Table <project_id>:<dataset_id>.result_data. Field loaded_at has changed type from TIMESTAMP to STRING
我一直在尝试将
loaded_at
作为字符串使用,但随后出现错误: google.api_core.exceptions.BadRequest: 400 Provided Schema does not match Table <project_id>:<dataset_id>.result_data. Field loaded_at has changed type from TIMESTAMP to STRING
google.api_core.exceptions.BadRequest: 400 Provided Schema does not match Table <project_id>:<dataset_id>.result_data. Field loaded_at has changed type from TIMESTAMP to STRING
I also tried adding job_config.autodetect = False
in the job configuration but doesn't solve the issue as well. 我还尝试在作业配置中添加
job_config.autodetect = False
,但也无法解决问题。
Any idea on how I can get the date to always work? 关于如何获取日期才能正常工作的任何想法吗?
Thanks! 谢谢!
Can you try hardcoding the loaded_at
for a sample run into a dummy table and see what happens? 您可以尝试对
loaded_at
到哑表中的示例进行硬编码,看看会发生什么吗? Your code looks fine, so I am sure it is something to do with pandas based loading. 您的代码看起来不错,所以我确定这与基于熊猫的加载有关。
Alternatively, if you want to avoid pandas loading data into BigQuery, you can use bq
cli to do the job for you: 另外,如果要避免熊猫将数据加载到BigQuery中,可以使用
bq
cli为您完成此工作:
import subprocess
#--you compose your df in this block
# df = ...
loaded_at = datetime.utcnow()
loaded_at_epoch = int((loaded_at - datetime(1970, 1, 1)).total_seconds())
df['loaded_at'] = pd.Series(loaded_at, index=df.index)
df['loaded_at_epoch'] = pd.Series(loaded_at_epoch, index=df.index)
#--write the file locally
df.to_csv('temp-data.csv', sep=',', index=False, header=False)
#--load via bq cli
cmd = '''bq --location=US load yourdataset.yourtable temp-data.csv col:type,col:type...'''
subprocess.call(cmd, shell=True)
Thanks Khan, this actually helped me figuring out how to fix it. 谢谢汗,这实际上帮助我弄清楚了如何解决它。 I first tried with a hardcoded timestamp in a string format but got the same issue.
我首先尝试使用字符串格式的硬编码时间戳,但遇到了同样的问题。 Then I tried with a hardcoded panda Timestamp and it worked.
然后,我尝试使用硬编码的熊猫时间戳,它可以正常工作。 The following code now works.
现在,以下代码有效。
df['loaded_at'] = pd.Series(pd.Timestamp(loaded_at_epoch, unit='s', tz='UTC'), index=df.index)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.