使用 pandas to_gbq 在 Google BigQuery 中创建表时读取数据时出现 400 错误

Question

I'm trying to query data from a MySQL server and write it to Google BigQuery using pandas.to_gbq api.我正在尝试从 MySQL 服务器查询数据并使用 pandas.to_gbq api 将其写入 Google BigQuery。

def production_to_gbq(table_name_prod,prefix,table_name_gbq,dataset,project):
    # Extract data from Production

    q = """
        SELECT *
        FROM
            {}
        """.format(table_name_prod)

    df = pd.read_sql(q, con)

    # Write to gbq    
    df.to_gbq(dataset + table_name_gbq, project, chunksize=1000, verbose=True, reauth=False, if_exists='replace', private_key=None)

    return df

I keep getting a 400 error indicating invalid input.我不断收到指示无效输入的 400 错误。

Load is 100.0% Complete
---------------------------------------------------------------------------
BadRequest                                Traceback (most recent call last)
/usr/local/lib/python3.6/site-packages/pandas_gbq/gbq.py in load_data(self, dataframe, dataset_id, table_id, chunksize, schema)
    569                     self.client, dataframe, dataset_id, table_id,
--> 570                     chunksize=chunksize):
    571                 self._print("\rLoad is {0}% Complete".format(

/usr/local/lib/python3.6/site-packages/pandas_gbq/_load.py in load_chunks(client, dataframe, dataset_id, table_id, chunksize, schema)
     73             destination_table,
---> 74             job_config=job_config).result()

/usr/local/lib/python3.6/site-packages/google/cloud/bigquery/job.py in result(self, timeout)
    527         # TODO: modify PollingFuture so it can pass a retry argument to done().
--> 528         return super(_AsyncJob, self).result(timeout=timeout)
    529 

/usr/local/lib/python3.6/site-packages/google/api_core/future/polling.py in result(self, timeout)
    110             # Pylint doesn't recognize that this is valid in this case.
--> 111             raise self._exception
    112 

BadRequest: 400 Error while reading data, error message: CSV table encountered too many errors, giving up. Rows: 10; errors: 1. Please look into the error stream for more details.

During handling of the above exception, another exception occurred:

GenericGBQException                       Traceback (most recent call last)
<ipython-input-73-ef9c7cec0104> in <module>()
----> 1 departments.to_gbq(dataset + table_name_gbq, project, chunksize=1000, verbose=True, reauth=False, if_exists='replace', private_key=None)
      2 

/usr/local/lib/python3.6/site-packages/pandas/core/frame.py in to_gbq(self, destination_table, project_id, chunksize, verbose, reauth, if_exists, private_key)
   1058         return gbq.to_gbq(self, destination_table, project_id=project_id,
   1059                           chunksize=chunksize, verbose=verbose, reauth=reauth,
-> 1060                           if_exists=if_exists, private_key=private_key)
   1061 
   1062     @classmethod

/usr/local/lib/python3.6/site-packages/pandas/io/gbq.py in to_gbq(dataframe, destination_table, project_id, chunksize, verbose, reauth, if_exists, private_key)
    107                       chunksize=chunksize,
    108                       verbose=verbose, reauth=reauth,
--> 109                       if_exists=if_exists, private_key=private_key)

/usr/local/lib/python3.6/site-packages/pandas_gbq/gbq.py in to_gbq(dataframe, destination_table, project_id, chunksize, verbose, reauth, if_exists, private_key, auth_local_webserver, table_schema)
    980     connector.load_data(
    981         dataframe, dataset_id, table_id, chunksize=chunksize,
--> 982         schema=table_schema)
    983 
    984 

/usr/local/lib/python3.6/site-packages/pandas_gbq/gbq.py in load_data(self, dataframe, dataset_id, table_id, chunksize, schema)
    572                     ((total_rows - remaining_rows) * 100) / total_rows))
    573         except self.http_error as ex:
--> 574             self.process_http_error(ex)
    575 
    576         self._print("\n")

/usr/local/lib/python3.6/site-packages/pandas_gbq/gbq.py in process_http_error(ex)
    453         # <https://cloud.google.com/bigquery/troubleshooting-errors>`__
    454 
--> 455         raise GenericGBQException("Reason: {0}".format(ex))
    456 
    457     def run_query(self, query, **kwargs):

GenericGBQException: Reason: 400 Error while reading data, error message: CSV table encountered too many errors, giving up. Rows: 10; errors: 1. Please look into the error stream for more details.

I've investigated the table schema,我调查了表架构，

id  INTEGER NULLABLE    
name    STRING  NULLABLE    
description STRING  NULLABLE    
created_at  INTEGER NULLABLE    
modified_at FLOAT   NULLABLE

and it is the same as the dataframe:它与 dataframe 相同：

id                        int64
name                     object
description              object
created_at                int64
modified_at             float64

The table is created in GBQ but remains empty.该表是在 GBQ 中创建的，但仍然是空的。

I read around a little but am not finding much on pandas.to_gbq api, except for this which seemed relevant but has no reply:我稍微阅读了一下，但在 pandas.to_gbq api 上没有找到太多，除了这似乎相关但没有回复：

bigquery table is empty when using pandas to_gbq 使用 pandas to_gbq 时 bigquery 表为空

I found one potential solution about numbers in the object datatypes which are being passed into the GBQ table without quotes, something fixed by setting column datatypes as string.我找到了一个关于 object 数据类型中数字的潜在解决方案，这些数据类型不带引号被传递到 GBQ 表中，通过将列数据类型设置为字符串来解决这个问题。

I use to_gbq on pandas for updating Google BigQuery and get GenericGBQException 我在 pandas 上使用 to_gbq 更新 Google BigQuery 并获取 GenericGBQException

I tried the fix:我尝试了修复：

for col in df.columns:
    if df[col].dtypes == object:
        df[col] = df[col].fillna('')
        df[col] = df[col].astype(str)

Unfortunately I still get the same error.不幸的是我仍然遇到同样的错误。 Similarly trying to format missing data and setting dtypes for int and float also gives the same error.同样，尝试格式化丢失的数据并为 int 和 float 设置数据类型也会产生相同的错误。

Is there something I'm missing?有什么我想念的吗？

Answer 1

Found that bigquery cannot properly handle \\r (sometimes \\n too) Had the same issue, localized the problem and I was really surprised when just replacing \\r with space fixed it: 发现bigquery无法正确处理\\ r \\ n （有时\\ n \\ n ）也有同样的问题，本地化问题，当我用空格替换\\ r时我真的很惊讶：

for col in list(df.columns):
    df[col] = df[col].apply(lambda x: x.replace(u'\r', u' ') if isinstance(x, str) or isinstance(x, unicode) else x)

Answer 2

I have ended up here several times when I've had a similar issue with importing to bigquery from parquet files on cloud storage. 当我从云存储上的镶木地板文件导入bigquery时出现类似问题时，我已经好几次了。 However, each time I've forgotten the way to solve it, so I hope it's not too much of a breach of protocol to leave my findings here! 但是，每次我都忘记了解决问题的方法，所以我希望在这里留下我的发现并不是太多违反协议的行为！

What I realised was that I have columns that are all NULL, which will look like they have a data type in pandas, but if you use pyarrow.parquet.read_schema(parquet_file), you will see that the data type is null. 我意识到我的列都是NULL，看起来它们在pandas中有数据类型，但是如果你使用pyarrow.parquet.read_schema（parquet_file），你会看到数据类型为null。

After removing the column, the upload will work! 删除列后，上传将起作用！

Answer 3

I had some invalid characters in string columns ( object in pandas ). 我在string列中有一些无效字符（ pandas object ）。 I use @Echochi approach and it worked 我使用@Echochi方法，它工作

for col in list(parsed_data.select_dtypes(include='object').columns):
        parsed_data[col] = parsed_data[col].apply(lambda x:re.sub('[^A-Za-z0-9]+','', str(x)))

It was a little bit to restrictive with the accepted characters, so I used a more general aproach, given that the biquery compatibility with UTF-8 bigquery docs 对于接受的字符有一点限制，所以我使用了更普遍的方法，因为biquery与UTF-8 bigquery docs的兼容性

for col in list(parsed_data.select_dtypes(include='object').columns):
        parsed_data[col] = parsed_data[col].apply(lambda x:re.sub(r"[^\u0900-\u097F]+",,'?', str(x)))

with r"[^\ऀ-\ॿ]+" you will accept all UTF-8 compatible charset 使用r"[^\ऀ-\ॿ]+"您将接受所有兼容UTF-8字符集

Answer 4

I had similar issue because of an unwanted column 'Unnamed 0' present in the dataset.由于数据集中存在不需要的列“未命名 0”，我遇到了类似的问题。 I removed that and problem got solved.我删除了它，问题就解决了。 Try to see the shape and head, if any null column or unwanted column exists in the dataset尝试查看形状和头部，如果数据集中存在任何 null 列或不需要的列

使用 pandas to_gbq 在 Google BigQuery 中创建表时读取数据时出现 400 错误

问题描述

4 个解决方案

解决方案1
2 2018-05-28 20:43:12

解决方案2
1 2018-12-03 16:49:36

解决方案3
0 2018-12-12 21:44:47

解决方案4
0 2023-01-16 14:34:04

使用 pandas to_gbq 在 Google BigQuery 中创建表时读取数据时出现 400 错误

问题描述

4 个解决方案

解决方案1 2 2018-05-28 20:43:12

解决方案2 1 2018-12-03 16:49:36

解决方案3 0 2018-12-12 21:44:47

解决方案4 0 2023-01-16 14:34:04

解决方案1
2 2018-05-28 20:43:12

解决方案2
1 2018-12-03 16:49:36

解决方案3
0 2018-12-12 21:44:47

解决方案4
0 2023-01-16 14:34:04