简体   繁体   English

将时间戳从Dataframe加载到BigQuery数据集

[英]Load timestamp from Dataframe to BigQuery dataset

I have a timestamp field loaded_at in my BigQuery table result_data and it's epoch equivalent loaded_at_epoch . 我的BigQuery表result_data有一个时间戳字段loaded_at ,它是与epoch等效的loaded_at_epoch I'm using Python to regularly get new data from an external source, add these two fields to the dataframe and load this dataframe in my BigQuery table. 我正在使用Python定期从外部来源获取新数据,将这两个字段添加到数据框中,然后将此数据框加载到我的BigQuery表中。

    loaded_at = datetime.utcnow()
    loaded_at_epoch = int((loaded_at - datetime(1970, 1, 1)).total_seconds()) 
    df['loaded_at'] = pd.Series(loaded_at, index=df.index)
    df['loaded_at_epoch'] = pd.Series(loaded_at_epoch, index=df.index)

    job_config = bigquery.LoadJobConfig()
    job_config.write_disposition = bigquery.WriteDisposition.WRITE_APPEND
    job_config.schema_update_options = [bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION]

    bq_client.load_table_from_dataframe(df, result_data, location="US", job_config=job_config,)

It used to work but since a couple of weeks, the loaded_at has wrong values such as 1970-01-19 03:32:09.693 UTC, while the loaded_at_epoch has correct timestamp values. 它曾经可以工作,但是几周以来, loaded_at值错误,例如1970-01-19 03:32:09.693 UTC,而loaded_at_epoch时间戳值正确。 It looks like somehow the timestamps are in seconds but are interpreted as being in milliseconds when loaded from the dataframe. 看起来时间戳以秒为单位,但从数据帧加载时以毫秒为单位。

I'm not sure how to make this work. 我不确定如何使这项工作。 I've been trying to have loaded_at as a string but then I get an error: google.api_core.exceptions.BadRequest: 400 Provided Schema does not match Table <project_id>:<dataset_id>.result_data. Field loaded_at has changed type from TIMESTAMP to STRING 我一直在尝试将loaded_at作为字符串使用,但随后出现错误: google.api_core.exceptions.BadRequest: 400 Provided Schema does not match Table <project_id>:<dataset_id>.result_data. Field loaded_at has changed type from TIMESTAMP to STRING google.api_core.exceptions.BadRequest: 400 Provided Schema does not match Table <project_id>:<dataset_id>.result_data. Field loaded_at has changed type from TIMESTAMP to STRING

I also tried adding job_config.autodetect = False in the job configuration but doesn't solve the issue as well. 我还尝试在作业配置中添加job_config.autodetect = False ,但也无法解决问题。

Any idea on how I can get the date to always work? 关于如何获取日期才能正常工作的任何想法吗?

Thanks! 谢谢!

Can you try hardcoding the loaded_at for a sample run into a dummy table and see what happens? 您可以尝试对loaded_at到哑表中的示例进行硬编码,看看会发生什么吗? Your code looks fine, so I am sure it is something to do with pandas based loading. 您的代码看起来不错,所以我确定这与基于熊猫的加载有关。

Alternatively, if you want to avoid pandas loading data into BigQuery, you can use bq cli to do the job for you: 另外,如果要避免熊猫将数据加载到BigQuery中,可以使用bq cli为您完成此工作:

import subprocess
#--you compose your df in this block
# df = ...

loaded_at = datetime.utcnow()
loaded_at_epoch = int((loaded_at - datetime(1970, 1, 1)).total_seconds()) 
df['loaded_at'] = pd.Series(loaded_at, index=df.index)
df['loaded_at_epoch'] = pd.Series(loaded_at_epoch, index=df.index)

#--write the file locally
df.to_csv('temp-data.csv', sep=',', index=False, header=False)

#--load via bq cli
cmd = '''bq --location=US load yourdataset.yourtable temp-data.csv col:type,col:type...'''
subprocess.call(cmd, shell=True)

Thanks Khan, this actually helped me figuring out how to fix it. 谢谢汗,这实际上帮助我弄清楚了如何解决它。 I first tried with a hardcoded timestamp in a string format but got the same issue. 我首先尝试使用字符串格式的硬编码时间戳,但遇到了同样的问题。 Then I tried with a hardcoded panda Timestamp and it worked. 然后,我尝试使用硬编码的熊猫时间戳,它可以正常工作。 The following code now works. 现在,以下代码有效。

df['loaded_at'] = pd.Series(pd.Timestamp(loaded_at_epoch, unit='s', tz='UTC'), index=df.index)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将大量数据从dask数据帧加载到bigquery - Load a huge data from dask dataframe to bigquery 使用python处理来自bigquery的巨大数据集,并将其加载回bigquery表 - Processing huge dataset from bigquery using python, load it back to a bigquery table 使用多进程池通过 Python 将 load_table_from_dataframe 加载到 BigQuery - Using multiprocess Pool to load_table_from_dataframe into BigQuery with Python Google Cloud BigQuery load_table_from_dataframe()实木复合地板AttributeError - Google Cloud BigQuery load_table_from_dataframe() Parquet AttributeError 从 bigquery 数据集和 pandas 下载大数据 - Downloading Large data from bigquery dataset and pandas 从共享数据集中提取BigQuery数据 - Extracting BigQuery Data From a Shared Dataset Google BigQuery Schema 冲突(pyarrow 错误)与使用 load_table_from_dataframe 的数字数据类型 - Google BigQuery Schema conflict (pyarrow error) with Numeric data type using load_table_from_dataframe BigQuery 数据集如何从 dataset.AccessEntry 中删除角色 - BigQuery Dataset How to Remove Roles From dataset.AccessEntry 如何将 Pandas dataframe 加载到惊喜数据集中? - How to load Pandas dataframe into Surprise dataset? 从 pandas dataframe 获取时间戳 - Get timestamp from pandas dataframe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM