简体   繁体   中英

Issues regarding insert_rows_from_dataframe using Big Query

I want to to insert a dataframe into a table in GCP, say the name of the table is table_id. I want to use the following

insert_rows_from_dataframe(table: Union[google.cloud.bigquery.table.Table, google.cloud.bigquery.table.TableReference, str], dataframe, selected_fields: Optional[Sequence[google.cloud.bigquery.schema.SchemaField]] = None, chunk_size: int = 500, **kwargs: Dict) → Sequence[Sequence[dict]][source]

I get it from the documentation https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.client.Client.html#google.cloud.bigquery.client.Client.insert_rows_from_dataframe

I am getting errors as I'm probably not writing it in a proper way. It is saying errors related to "schema" name. Schema name is given for table_id, which I am using. Kindly help to provide me a sample example using this, insert_rows_from_dataframe , particularly

selected_fields: Optional[Sequence[google.cloud.bigquery.schema.SchemaField]] = None, chunk_size: int = 500, **kwargs: Dict

The documentation that you stated explains what types and what parameters are optional and so on. This documentation is auto generated with Sphinx I recommend you to try and understand how this documentation works, it will be useful.

If you click on the source code of the method you can see what the method is actually expecting:

def insert_rows_from_dataframe(
    self,
    table: Union[Table, TableReference, str],
    dataframe,
    selected_fields: Sequence[SchemaField] = None,
    chunk_size: int = 500,
    **kwargs: Dict,
) -> Sequence[Sequence[dict]]:
    """Insert rows into a table from a dataframe via the streaming API.

    Args:
        table (Union[ \
            google.cloud.bigquery.table.Table, \
            google.cloud.bigquery.table.TableReference, \
            str, \
        ]):
            The destination table for the row data, or a reference to it.
        dataframe (pandas.DataFrame):
            A :class:`~pandas.DataFrame` containing the data to load. Any
            ``NaN`` values present in the dataframe are omitted from the
            streaming API request(s).
        selected_fields (Sequence[google.cloud.bigquery.schema.SchemaField]):
            The fields to return. Required if ``table`` is a
            :class:`~google.cloud.bigquery.table.TableReference`.
        chunk_size (int):
            The number of rows to stream in a single chunk. Must be positive.
        kwargs (Dict):
            Keyword arguments to
            :meth:`~google.cloud.bigquery.client.Client.insert_rows_json`.

    Returns:
        Sequence[Sequence[Mappings]]:
            A list with insert errors for each insert chunk. Each element
            is a list containing one mapping per row with insert errors:
            the "index" key identifies the row, and the "errors" key
            contains a list of the mappings describing one or more problems
            with the row.

    Raises:
        ValueError: if table's schema is not set
    """
    insert_results = []

    chunk_count = int(math.ceil(len(dataframe) / chunk_size))
    rows_iter = _pandas_helpers.dataframe_to_json_generator(dataframe)

    for _ in range(chunk_count):
        rows_chunk = itertools.islice(rows_iter, chunk_size)
        result = self.insert_rows(table, rows_chunk, selected_fields, **kwargs)
        insert_results.append(result)

    return insert_results´´´

So for using this method an example would be:

from google.cloud import bigquery
bq_client = bigquery.Client()
table=table = bq_client.get_table("{}.{}.{}".format(PROJECT, DATASET, 
TABLE))
dataframe= yourDataFrame

bq_client.insert_rows_from_dataframe(table,dataframe)

In order to pass the schema to the insert_rows_from_dataframe() function, you need to pass it in the selected_fields paramteter. As you mentioned, it is of type Sequence[google.cloud.bigquery.schema.SchemaField] .

So first you have to import this class:

from google.cloud.bigquery.schema import SchemaField

Then, for the actual call

client.insert_rows_from_dataframe(table=table_id, dataframe=df, selected_fields=schema)

Where

  • table_id is <dataset_name>.<table_name>
  • df is the dataframe with columns (and datatypes) matching those of the table
  • schema is a list of SchemaField objects

An example for a schema:

[SchemaField(name="birth_date", field_type="DATE", mode="REQUIRED"),
 SchemaField(name="user_name", field_type="STRING", mode="REQUIRED"),
 SchemaField(name="score", field_type="INTEGER", mode="NULLABLE"),
 ...]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM