Issues regarding insert_rows_from_dataframe using Big Query

Question

I want to to insert a dataframe into a table in GCP, say the name of the table is table_id. I want to use the following

insert_rows_from_dataframe(table: Union[google.cloud.bigquery.table.Table, google.cloud.bigquery.table.TableReference, str], dataframe, selected_fields: Optional[Sequence[google.cloud.bigquery.schema.SchemaField]] = None, chunk_size: int = 500, **kwargs: Dict) → Sequence[Sequence[dict]][source]

I get it from the documentation https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.client.Client.html#google.cloud.bigquery.client.Client.insert_rows_from_dataframe

I am getting errors as I'm probably not writing it in a proper way. It is saying errors related to "schema" name. Schema name is given for table_id, which I am using. Kindly help to provide me a sample example using this, insert_rows_from_dataframe , particularly

selected_fields: Optional[Sequence[google.cloud.bigquery.schema.SchemaField]] = None, chunk_size: int = 500, **kwargs: Dict

Answer 1

The documentation that you stated explains what types and what parameters are optional and so on. This documentation is auto generated with Sphinx I recommend you to try and understand how this documentation works, it will be useful.

If you click on the source code of the method you can see what the method is actually expecting:

def insert_rows_from_dataframe(
    self,
    table: Union[Table, TableReference, str],
    dataframe,
    selected_fields: Sequence[SchemaField] = None,
    chunk_size: int = 500,
    **kwargs: Dict,
) -> Sequence[Sequence[dict]]:
    """Insert rows into a table from a dataframe via the streaming API.

    Args:
        table (Union[ \
            google.cloud.bigquery.table.Table, \
            google.cloud.bigquery.table.TableReference, \
            str, \
        ]):
            The destination table for the row data, or a reference to it.
        dataframe (pandas.DataFrame):
            A :class:`~pandas.DataFrame` containing the data to load. Any
            ``NaN`` values present in the dataframe are omitted from the
            streaming API request(s).
        selected_fields (Sequence[google.cloud.bigquery.schema.SchemaField]):
            The fields to return. Required if ``table`` is a
            :class:`~google.cloud.bigquery.table.TableReference`.
        chunk_size (int):
            The number of rows to stream in a single chunk. Must be positive.
        kwargs (Dict):
            Keyword arguments to
            :meth:`~google.cloud.bigquery.client.Client.insert_rows_json`.

    Returns:
        Sequence[Sequence[Mappings]]:
            A list with insert errors for each insert chunk. Each element
            is a list containing one mapping per row with insert errors:
            the "index" key identifies the row, and the "errors" key
            contains a list of the mappings describing one or more problems
            with the row.

    Raises:
        ValueError: if table's schema is not set
    """
    insert_results = []

    chunk_count = int(math.ceil(len(dataframe) / chunk_size))
    rows_iter = _pandas_helpers.dataframe_to_json_generator(dataframe)

    for _ in range(chunk_count):
        rows_chunk = itertools.islice(rows_iter, chunk_size)
        result = self.insert_rows(table, rows_chunk, selected_fields, **kwargs)
        insert_results.append(result)

    return insert_results´´´

So for using this method an example would be:

from google.cloud import bigquery
bq_client = bigquery.Client()
table=table = bq_client.get_table("{}.{}.{}".format(PROJECT, DATASET, 
TABLE))
dataframe= yourDataFrame

bq_client.insert_rows_from_dataframe(table,dataframe)

Answer 2

In order to pass the schema to the insert_rows_from_dataframe() function, you need to pass it in the selected_fields paramteter. As you mentioned, it is of type Sequence[google.cloud.bigquery.schema.SchemaField] .

So first you have to import this class:

from google.cloud.bigquery.schema import SchemaField

Then, for the actual call

client.insert_rows_from_dataframe(table=table_id, dataframe=df, selected_fields=schema)

Where

table_id is <dataset_name>.<table_name>
df is the dataframe with columns (and datatypes) matching those of the table
schema is a list of SchemaField objects

An example for a schema:

[SchemaField(name="birth_date", field_type="DATE", mode="REQUIRED"),
 SchemaField(name="user_name", field_type="STRING", mode="REQUIRED"),
 SchemaField(name="score", field_type="INTEGER", mode="NULLABLE"),
 ...]

Issues regarding insert_rows_from_dataframe using Big Query

Question

2 answers

solution1
0 2021-10-05 11:52:42

solution2
0 2022-12-25 17:32:07

Issues regarding insert_rows_from_dataframe using Big Query

Question

2 answers

solution1 0 2021-10-05 11:52:42

solution2 0 2022-12-25 17:32:07

solution1
0 2021-10-05 11:52:42

solution2
0 2022-12-25 17:32:07