简体   繁体   English

有关使用 Big Query insert_rows_from_dataframe 的问题

[英]Issues regarding insert_rows_from_dataframe using Big Query

I want to to insert a dataframe into a table in GCP, say the name of the table is table_id.我想将 dataframe 插入到 GCP 中的表中,假设表的名称是 table_id。 I want to use the following我想使用以下

insert_rows_from_dataframe(table: Union[google.cloud.bigquery.table.Table, google.cloud.bigquery.table.TableReference, str], dataframe, selected_fields: Optional[Sequence[google.cloud.bigquery.schema.SchemaField]] = None, chunk_size: int = 500, **kwargs: Dict) → Sequence[Sequence[dict]][source]

I get it from the documentation https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.client.Client.html#google.cloud.bigquery.client.Client.insert_rows_from_dataframe我从文档https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.client.Client.html#google.cloud.bigquery.client.Client.insert_rows_from_dataframe

I am getting errors as I'm probably not writing it in a proper way.我收到错误,因为我可能没有以正确的方式编写它。 It is saying errors related to "schema" name.它说的是与“架构”名称相关的错误。 Schema name is given for table_id, which I am using.为我正在使用的 table_id 提供了架构名称。 Kindly help to provide me a sample example using this, insert_rows_from_dataframe , particularly请帮助我提供一个使用此示例的示例, insert_rows_from_dataframe ,特别是

selected_fields: Optional[Sequence[google.cloud.bigquery.schema.SchemaField]] = None, chunk_size: int = 500, **kwargs: Dict

The documentation that you stated explains what types and what parameters are optional and so on.您陈述的文档解释了哪些类型和哪些参数是可选的等等。 This documentation is auto generated with Sphinx I recommend you to try and understand how this documentation works, it will be useful.本文档是使用 Sphinx 自动生成的 我建议您尝试并了解本文档的工作原理,它将很有用。

If you click on the source code of the method you can see what the method is actually expecting:如果您单击该方法的源代码,您可以看到该方法实际需要什么:

def insert_rows_from_dataframe(
    self,
    table: Union[Table, TableReference, str],
    dataframe,
    selected_fields: Sequence[SchemaField] = None,
    chunk_size: int = 500,
    **kwargs: Dict,
) -> Sequence[Sequence[dict]]:
    """Insert rows into a table from a dataframe via the streaming API.

    Args:
        table (Union[ \
            google.cloud.bigquery.table.Table, \
            google.cloud.bigquery.table.TableReference, \
            str, \
        ]):
            The destination table for the row data, or a reference to it.
        dataframe (pandas.DataFrame):
            A :class:`~pandas.DataFrame` containing the data to load. Any
            ``NaN`` values present in the dataframe are omitted from the
            streaming API request(s).
        selected_fields (Sequence[google.cloud.bigquery.schema.SchemaField]):
            The fields to return. Required if ``table`` is a
            :class:`~google.cloud.bigquery.table.TableReference`.
        chunk_size (int):
            The number of rows to stream in a single chunk. Must be positive.
        kwargs (Dict):
            Keyword arguments to
            :meth:`~google.cloud.bigquery.client.Client.insert_rows_json`.

    Returns:
        Sequence[Sequence[Mappings]]:
            A list with insert errors for each insert chunk. Each element
            is a list containing one mapping per row with insert errors:
            the "index" key identifies the row, and the "errors" key
            contains a list of the mappings describing one or more problems
            with the row.

    Raises:
        ValueError: if table's schema is not set
    """
    insert_results = []

    chunk_count = int(math.ceil(len(dataframe) / chunk_size))
    rows_iter = _pandas_helpers.dataframe_to_json_generator(dataframe)

    for _ in range(chunk_count):
        rows_chunk = itertools.islice(rows_iter, chunk_size)
        result = self.insert_rows(table, rows_chunk, selected_fields, **kwargs)
        insert_results.append(result)

    return insert_results´´´

So for using this method an example would be:因此,使用此方法的示例是:

from google.cloud import bigquery
bq_client = bigquery.Client()
table=table = bq_client.get_table("{}.{}.{}".format(PROJECT, DATASET, 
TABLE))
dataframe= yourDataFrame

bq_client.insert_rows_from_dataframe(table,dataframe)

In order to pass the schema to the insert_rows_from_dataframe() function, you need to pass it in the selected_fields paramteter.为了将架构传递给insert_rows_from_dataframe() function,您需要将其传递到selected_fields参数中。 As you mentioned, it is of type Sequence[google.cloud.bigquery.schema.SchemaField] .正如您提到的,它的类型是Sequence[google.cloud.bigquery.schema.SchemaField]

So first you have to import this class:所以首先你必须导入这个 class:

from google.cloud.bigquery.schema import SchemaField

Then, for the actual call然后,对于实际调用

client.insert_rows_from_dataframe(table=table_id, dataframe=df, selected_fields=schema)

Where在哪里

  • table_id is <dataset_name>.<table_name> table_id是 <dataset_name>.<table_name>
  • df is the dataframe with columns (and datatypes) matching those of the table df是 dataframe,其列(和数据类型)与表中的相匹配
  • schema is a list of SchemaField objects schema是 SchemaField 对象的列表

An example for a schema:架构示例:

[SchemaField(name="birth_date", field_type="DATE", mode="REQUIRED"),
 SchemaField(name="user_name", field_type="STRING", mode="REQUIRED"),
 SchemaField(name="score", field_type="INTEGER", mode="NULLABLE"),
 ...]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM