[英]How to Insert a partition into BigQuery's fetch time partitioned table in Python by specifying a partition
How can I use Python to specify partitions in the fetch time partitioning table to fetch?如何使用 Python 指定获取时间分区表中的分区进行获取?
I have found that the following is possible when inserting in SQL.我发现在 SQL 中插入时可能会出现以下情况。 https://cloud.google.com/bigquery/docs/using-dml-with-partitioned-tables
https://cloud.google.com/bigquery/docs/using-dml-with-partitioned-tables
but I don't know how to describe it in Python.但我不知道如何在 Python 中描述它。 I am thinking of using "client.load_table_from_dataframe" from google-cloud-bigquery module.
我正在考虑使用 google-cloud-bigquery 模块中的“client.load_table_from_dataframe”。 https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.client.Client.html#google.cloud.bigquery.client.Client.load_table_from_dataframe
https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.client.Client.html#google.cloud.bigquery.client.Client.load_table_from_dataframe
I found the following document, but when I use the name _PARTITIONTIME
I get the following error.我找到了以下文档,但是当我使用名称
_PARTITIONTIME
时出现以下错误。 https://cloud.google.com/bigquery/docs/samples/bigquery-load-table-partitioned#bigquery_load_table_partitioned-python https://cloud.google.com/bigquery/docs/samples/bigquery-load-table-partitioned#bigquery_load_table_partitioned-python
google.api_core.exceptions.BadRequest: 400 POST https://bigquery.googleapis.com/upload/bigquery/v2/projects/aaa/jobs?uploadType=multipart: Invalid field name "_PARTITIONTIME". Field names are not allowed to start with the (case-insensitive) prefixes _PARTITION, _TABLE_, _FILE_, _ROW_TIMESTAMP, __ROOT__ and _COLIDENTIFIER
CREATE TABLE IF NOT EXISTS `aaa.bbb.ccc`(
c1 INTEGER,
c2 STRING
)
PARTITION BY _PARTITIONDATE;
INSERT INTO `aaa.bbb.ccc` (c1, c2, _PARTITIONTIME) VALUES (99, "zz", TIMESTAMP("2000-01-02"));
import pandas as pd
from google.cloud import bigquery
from google.cloud.bigquery.enums import SqlTypeNames
from google.cloud.bigquery.job import WriteDisposition
from datetime import datetime
client = bigquery.Client(project="aaa")
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("c1", SqlTypeNames.INTEGER),
bigquery.SchemaField("c2", SqlTypeNames.STRING),
bigquery.SchemaField("_PARTITIONTIME", SqlTypeNames.TIMESTAMP),
],
write_disposition=WriteDisposition.WRITE_APPEND,
time_partitioning=bigquery.TimePartitioning(
type_=bigquery.TimePartitioningType.DAY,
field="_PARTITIONTIME", # Name of the column to use for partitioning.
expiration_ms=7776000000, # 90 days.
),
)
df = pd.DataFrame(
[
[1, "a", datetime.strptime("2100-11-12", "%Y-%m-%d")],
[2, "b", datetime.strptime("2101-12-13", "%Y-%m-%d")],
],
columns=["c1", "c2", "_PARTITIONTIME"],
)
job = client.load_table_from_dataframe(df, "aaa.bbb.ccc", job_config=job_config) # error
result = job.result()
We also ask the following questions.我们还提出以下问题。 https://ja.stackoverflow.com/questions/90760
https://ja.stackoverflow.com/questions/90760
You can just change the naming convention _PARTITIONTIME
to another name since it is part of case sensitive prefixes.您可以将命名约定
_PARTITIONTIME
更改为另一个名称,因为它是区分大小写的前缀的一部分。 The code below worked:下面的代码有效:
import pandas as pd
from google.cloud import bigquery
from google.cloud.bigquery.enums import SqlTypeNames
from google.cloud.bigquery.job import WriteDisposition
from datetime import datetime
client = bigquery.Client(project="<your-project>")
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("c1", SqlTypeNames.INTEGER),
bigquery.SchemaField("c2", SqlTypeNames.STRING),
bigquery.SchemaField("_P1", SqlTypeNames.TIMESTAMP),
],
write_disposition=WriteDisposition.WRITE_APPEND,
time_partitioning=bigquery.TimePartitioning(
type_=bigquery.TimePartitioningType.DAY,
field="_P1", # Name of the column to use for partitioning.
expiration_ms=7776000000, # 90 days.
),
)
df = pd.DataFrame(
[
[1, "a", datetime.strptime("2100-11-12", "%Y-%m-%d")],
[2, "b", datetime.strptime("2101-12-13", "%Y-%m-%d")],
],
columns=["c1", "c2", "_P1"],
)
job = client.load_table_from_dataframe(df, "<your-project>.<your-dataset>.ccc", job_config=job_config) # error
result = job.result()
Output: Output:
As for the query you want to insert:至于要插入的查询:
INSERT INTO `<your-project>.<your-dataset>.ccc` (c1, c2, _P1) VALUES (99, "zz", TIMESTAMP("2000-01-02"));
This is not possible as explained in this SO post answered by a Googler.正如 Googler 回答的这篇SO 帖子中所解释的那样,这是不可能的。 Since in the
expiration_ms
field we stated that the expiration is 90 days, 90 days before the current day (the day python script is executed) are valid dates, anything beyond that are not valid.由于在
expiration_ms
字段中,我们声明到期时间为 90 天,因此当前日期前 90 天(执行 python 脚本的日期)是有效日期,超出此日期的任何内容均无效。 This query will work:此查询将起作用:
INSERT INTO `<your-project>.<your-dataset>.ccc` (c1, c2, _P1) VALUES (99, "zz", TIMESTAMP("2022-06-01"));
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.