简体   繁体   English

如何通过指定分区将分区插入到 Python 中的 BigQuery 提取时间分区表中

[英]How to Insert a partition into BigQuery's fetch time partitioned table in Python by specifying a partition

summary概括

How can I use Python to specify partitions in the fetch time partitioning table to fetch?如何使用 Python 指定获取时间分区表中的分区进行获取?

What we tried我们尝试了什么

I have found that the following is possible when inserting in SQL.我发现在 SQL 中插入时可能会出现以下情况。 https://cloud.google.com/bigquery/docs/using-dml-with-partitioned-tables https://cloud.google.com/bigquery/docs/using-dml-with-partitioned-tables

but I don't know how to describe it in Python.但我不知道如何在 Python 中描述它。 I am thinking of using "client.load_table_from_dataframe" from google-cloud-bigquery module.我正在考虑使用 google-cloud-bigquery 模块中的“client.load_table_from_dataframe”。 https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.client.Client.html#google.cloud.bigquery.client.Client.load_table_from_dataframe https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.client.Client.html#google.cloud.bigquery.client.Client.load_table_from_dataframe

I found the following document, but when I use the name _PARTITIONTIME I get the following error.我找到了以下文档,但是当我使用名称_PARTITIONTIME时出现以下错误。 https://cloud.google.com/bigquery/docs/samples/bigquery-load-table-partitioned#bigquery_load_table_partitioned-python https://cloud.google.com/bigquery/docs/samples/bigquery-load-table-partitioned#bigquery_load_table_partitioned-python

google.api_core.exceptions.BadRequest: 400 POST https://bigquery.googleapis.com/upload/bigquery/v2/projects/aaa/jobs?uploadType=multipart: Invalid field name "_PARTITIONTIME". Field names are not allowed to start with the (case-insensitive) prefixes _PARTITION, _TABLE_, _FILE_, _ROW_TIMESTAMP, __ROOT__ and _COLIDENTIFIER

execution environment执行环境

  • python: 3.8.10 python:3.8.10
  • google-cloud-bigquery: 3.2.0谷歌云大查询:3.2.0
  • pandas: 1.4.3 pandas:1.4.3
  • About Certification关于认证
    • If PARTITION is not specified, we consider that there is no problem because data can be inserted.如果没有指定 PARTITION,我们认为没有问题,因为可以插入数据。

table桌子

CREATE TABLE IF NOT EXISTS `aaa.bbb.ccc`(
  c1 INTEGER,
  c2 STRING
)
PARTITION BY _PARTITIONDATE;

What I want to do我想做的事

SQL SQL

INSERT INTO `aaa.bbb.ccc` (c1, c2, _PARTITIONTIME) VALUES (99, "zz", TIMESTAMP("2000-01-02"));

Python ( Tried and tested code ) Python(经过试验和测试的代码)

import pandas as pd
from google.cloud import bigquery
from google.cloud.bigquery.enums import SqlTypeNames
from google.cloud.bigquery.job import WriteDisposition
from datetime import datetime

client = bigquery.Client(project="aaa")
job_config = bigquery.LoadJobConfig(
    schema=[
        bigquery.SchemaField("c1", SqlTypeNames.INTEGER),
        bigquery.SchemaField("c2", SqlTypeNames.STRING),
        bigquery.SchemaField("_PARTITIONTIME", SqlTypeNames.TIMESTAMP),
    ],
    write_disposition=WriteDisposition.WRITE_APPEND,
    time_partitioning=bigquery.TimePartitioning(
        type_=bigquery.TimePartitioningType.DAY,
        field="_PARTITIONTIME",  # Name of the column to use for partitioning.
        expiration_ms=7776000000,  # 90 days.
    ),
)
df = pd.DataFrame(
    [
        [1, "a", datetime.strptime("2100-11-12", "%Y-%m-%d")],
        [2, "b", datetime.strptime("2101-12-13", "%Y-%m-%d")],
    ],
    columns=["c1", "c2", "_PARTITIONTIME"],
)
job = client.load_table_from_dataframe(df, "aaa.bbb.ccc", job_config=job_config) # error
result = job.result()

multi-post多职位

We also ask the following questions.我们还提出以下问题。 https://ja.stackoverflow.com/questions/90760 https://ja.stackoverflow.com/questions/90760

You can just change the naming convention _PARTITIONTIME to another name since it is part of case sensitive prefixes.您可以将命名约定_PARTITIONTIME更改为另一个名称,因为它是区分大小写的前缀的一部分。 The code below worked:下面的代码有效:

import pandas as pd
from google.cloud import bigquery
from google.cloud.bigquery.enums import SqlTypeNames
from google.cloud.bigquery.job import WriteDisposition
from datetime import datetime

client = bigquery.Client(project="<your-project>")
job_config = bigquery.LoadJobConfig(
    schema=[
        bigquery.SchemaField("c1", SqlTypeNames.INTEGER),
        bigquery.SchemaField("c2", SqlTypeNames.STRING),
        bigquery.SchemaField("_P1", SqlTypeNames.TIMESTAMP),
    ],
    write_disposition=WriteDisposition.WRITE_APPEND,
    time_partitioning=bigquery.TimePartitioning(
        type_=bigquery.TimePartitioningType.DAY,
        field="_P1",  # Name of the column to use for partitioning.
        expiration_ms=7776000000,  # 90 days.
    ),
)
df = pd.DataFrame(
    [
        [1, "a", datetime.strptime("2100-11-12", "%Y-%m-%d")],
        [2, "b", datetime.strptime("2101-12-13", "%Y-%m-%d")],
    ],
    columns=["c1", "c2", "_P1"],
)
job = client.load_table_from_dataframe(df, "<your-project>.<your-dataset>.ccc", job_config=job_config) # error
result = job.result()

Output: Output:

在此处输入图像描述

As for the query you want to insert:至于要插入的查询:

INSERT INTO `<your-project>.<your-dataset>.ccc` (c1, c2, _P1) VALUES (99, "zz", TIMESTAMP("2000-01-02"));

This is not possible as explained in this SO post answered by a Googler.正如 Googler 回答的这篇SO 帖子中所解释的那样,这是不可能的。 Since in the expiration_ms field we stated that the expiration is 90 days, 90 days before the current day (the day python script is executed) are valid dates, anything beyond that are not valid.由于在expiration_ms字段中,我们声明到期时间为 90 天,因此当前日期前 90 天(执行 python 脚本的日期)是有效日期,超出此日期的任何内容均无效。 This query will work:此查询将起作用:

INSERT INTO `<your-project>.<your-dataset>.ccc` (c1, c2, _P1) VALUES (99, "zz", TIMESTAMP("2022-06-01"));

Output: Output: 在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM