将时间戳从Dataframe加载到BigQuery数据集

Question

I have a timestamp field loaded_at in my BigQuery table result_data and it's epoch equivalent loaded_at_epoch . 我的BigQuery表result_data有一个时间戳字段loaded_at ，它是与epoch等效的loaded_at_epoch 。 I'm using Python to regularly get new data from an external source, add these two fields to the dataframe and load this dataframe in my BigQuery table. 我正在使用Python定期从外部来源获取新数据，将这两个字段添加到数据框中，然后将此数据框加载到我的BigQuery表中。

    loaded_at = datetime.utcnow()
    loaded_at_epoch = int((loaded_at - datetime(1970, 1, 1)).total_seconds()) 
    df['loaded_at'] = pd.Series(loaded_at, index=df.index)
    df['loaded_at_epoch'] = pd.Series(loaded_at_epoch, index=df.index)

    job_config = bigquery.LoadJobConfig()
    job_config.write_disposition = bigquery.WriteDisposition.WRITE_APPEND
    job_config.schema_update_options = [bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION]

    bq_client.load_table_from_dataframe(df, result_data, location="US", job_config=job_config,)

It used to work but since a couple of weeks, the loaded_at has wrong values such as 1970-01-19 03:32:09.693 UTC, while the loaded_at_epoch has correct timestamp values. 它曾经可以工作，但是几周以来， loaded_at值错误，例如1970-01-19 03：32：09.693 UTC，而loaded_at_epoch时间戳值正确。 It looks like somehow the timestamps are in seconds but are interpreted as being in milliseconds when loaded from the dataframe. 看起来时间戳以秒为单位，但从数据帧加载时以毫秒为单位。

I'm not sure how to make this work. 我不确定如何使这项工作。 I've been trying to have loaded_at as a string but then I get an error: google.api_core.exceptions.BadRequest: 400 Provided Schema does not match Table <project_id>:<dataset_id>.result_data. Field loaded_at has changed type from TIMESTAMP to STRING 我一直在尝试将loaded_at作为字符串使用，但随后出现错误： google.api_core.exceptions.BadRequest: 400 Provided Schema does not match Table <project_id>:<dataset_id>.result_data. Field loaded_at has changed type from TIMESTAMP to STRING google.api_core.exceptions.BadRequest: 400 Provided Schema does not match Table <project_id>:<dataset_id>.result_data. Field loaded_at has changed type from TIMESTAMP to STRING

I also tried adding job_config.autodetect = False in the job configuration but doesn't solve the issue as well. 我还尝试在作业配置中添加job_config.autodetect = False ，但也无法解决问题。

Any idea on how I can get the date to always work? 关于如何获取日期才能正常工作的任何想法吗？

Thanks! 谢谢！

Answer 1

Can you try hardcoding the loaded_at for a sample run into a dummy table and see what happens? 您可以尝试对loaded_at到哑表中的示例进行硬编码，看看会发生什么吗？ Your code looks fine, so I am sure it is something to do with pandas based loading. 您的代码看起来不错，所以我确定这与基于熊猫的加载有关。

Alternatively, if you want to avoid pandas loading data into BigQuery, you can use bq cli to do the job for you: 另外，如果要避免熊猫将数据加载到BigQuery中，可以使用bq cli为您完成此工作：

import subprocess
#--you compose your df in this block
# df = ...

loaded_at = datetime.utcnow()
loaded_at_epoch = int((loaded_at - datetime(1970, 1, 1)).total_seconds()) 
df['loaded_at'] = pd.Series(loaded_at, index=df.index)
df['loaded_at_epoch'] = pd.Series(loaded_at_epoch, index=df.index)

#--write the file locally
df.to_csv('temp-data.csv', sep=',', index=False, header=False)

#--load via bq cli
cmd = '''bq --location=US load yourdataset.yourtable temp-data.csv col:type,col:type...'''
subprocess.call(cmd, shell=True)

Answer 2

Thanks Khan, this actually helped me figuring out how to fix it. 谢谢汗，这实际上帮助我弄清楚了如何解决它。 I first tried with a hardcoded timestamp in a string format but got the same issue. 我首先尝试使用字符串格式的硬编码时间戳，但遇到了同样的问题。 Then I tried with a hardcoded panda Timestamp and it worked. 然后，我尝试使用硬编码的熊猫时间戳，它可以正常工作。 The following code now works. 现在，以下代码有效。

df['loaded_at'] = pd.Series(pd.Timestamp(loaded_at_epoch, unit='s', tz='UTC'), index=df.index)

将时间戳从Dataframe加载到BigQuery数据集

问题描述

2 个解决方案

解决方案1
0 2019-09-15 16:49:43

解决方案2
0 2019-09-16 13:53:11

将时间戳从Dataframe加载到BigQuery数据集

问题描述

2 个解决方案

解决方案1 0 2019-09-15 16:49:43

解决方案2 0 2019-09-16 13:53:11

解决方案1
0 2019-09-15 16:49:43

解决方案2
0 2019-09-16 13:53:11