使用 pandas to_gbq 在 Google BigQuery 中创建表时读取数据时出现 400 错误

Question

我正在尝试从 MySQL 服务器查询数据并使用 pandas.to_gbq api 将其写入 Google BigQuery。

def production_to_gbq(table_name_prod,prefix,table_name_gbq,dataset,project):
    # Extract data from Production

    q = """
        SELECT *
        FROM
            {}
        """.format(table_name_prod)

    df = pd.read_sql(q, con)

    # Write to gbq    
    df.to_gbq(dataset + table_name_gbq, project, chunksize=1000, verbose=True, reauth=False, if_exists='replace', private_key=None)

    return df

我不断收到指示无效输入的 400 错误。

Load is 100.0% Complete
---------------------------------------------------------------------------
BadRequest                                Traceback (most recent call last)
/usr/local/lib/python3.6/site-packages/pandas_gbq/gbq.py in load_data(self, dataframe, dataset_id, table_id, chunksize, schema)
    569                     self.client, dataframe, dataset_id, table_id,
--> 570                     chunksize=chunksize):
    571                 self._print("\rLoad is {0}% Complete".format(

/usr/local/lib/python3.6/site-packages/pandas_gbq/_load.py in load_chunks(client, dataframe, dataset_id, table_id, chunksize, schema)
     73             destination_table,
---> 74             job_config=job_config).result()

/usr/local/lib/python3.6/site-packages/google/cloud/bigquery/job.py in result(self, timeout)
    527         # TODO: modify PollingFuture so it can pass a retry argument to done().
--> 528         return super(_AsyncJob, self).result(timeout=timeout)
    529 

/usr/local/lib/python3.6/site-packages/google/api_core/future/polling.py in result(self, timeout)
    110             # Pylint doesn't recognize that this is valid in this case.
--> 111             raise self._exception
    112 

BadRequest: 400 Error while reading data, error message: CSV table encountered too many errors, giving up. Rows: 10; errors: 1. Please look into the error stream for more details.

During handling of the above exception, another exception occurred:

GenericGBQException                       Traceback (most recent call last)
<ipython-input-73-ef9c7cec0104> in <module>()
----> 1 departments.to_gbq(dataset + table_name_gbq, project, chunksize=1000, verbose=True, reauth=False, if_exists='replace', private_key=None)
      2 

/usr/local/lib/python3.6/site-packages/pandas/core/frame.py in to_gbq(self, destination_table, project_id, chunksize, verbose, reauth, if_exists, private_key)
   1058         return gbq.to_gbq(self, destination_table, project_id=project_id,
   1059                           chunksize=chunksize, verbose=verbose, reauth=reauth,
-> 1060                           if_exists=if_exists, private_key=private_key)
   1061 
   1062     @classmethod

/usr/local/lib/python3.6/site-packages/pandas/io/gbq.py in to_gbq(dataframe, destination_table, project_id, chunksize, verbose, reauth, if_exists, private_key)
    107                       chunksize=chunksize,
    108                       verbose=verbose, reauth=reauth,
--> 109                       if_exists=if_exists, private_key=private_key)

/usr/local/lib/python3.6/site-packages/pandas_gbq/gbq.py in to_gbq(dataframe, destination_table, project_id, chunksize, verbose, reauth, if_exists, private_key, auth_local_webserver, table_schema)
    980     connector.load_data(
    981         dataframe, dataset_id, table_id, chunksize=chunksize,
--> 982         schema=table_schema)
    983 
    984 

/usr/local/lib/python3.6/site-packages/pandas_gbq/gbq.py in load_data(self, dataframe, dataset_id, table_id, chunksize, schema)
    572                     ((total_rows - remaining_rows) * 100) / total_rows))
    573         except self.http_error as ex:
--> 574             self.process_http_error(ex)
    575 
    576         self._print("\n")

/usr/local/lib/python3.6/site-packages/pandas_gbq/gbq.py in process_http_error(ex)
    453         # <https://cloud.google.com/bigquery/troubleshooting-errors>`__
    454 
--> 455         raise GenericGBQException("Reason: {0}".format(ex))
    456 
    457     def run_query(self, query, **kwargs):

GenericGBQException: Reason: 400 Error while reading data, error message: CSV table encountered too many errors, giving up. Rows: 10; errors: 1. Please look into the error stream for more details.

我调查了表架构，

id  INTEGER NULLABLE    
name    STRING  NULLABLE    
description STRING  NULLABLE    
created_at  INTEGER NULLABLE    
modified_at FLOAT   NULLABLE

它与 dataframe 相同：

id                        int64
name                     object
description              object
created_at                int64
modified_at             float64

该表是在 GBQ 中创建的，但仍然是空的。

我稍微阅读了一下，但在 pandas.to_gbq api 上没有找到太多，除了这似乎相关但没有回复：

使用 pandas to_gbq 时 bigquery 表为空

我找到了一个关于 object 数据类型中数字的潜在解决方案，这些数据类型不带引号被传递到 GBQ 表中，通过将列数据类型设置为字符串来解决这个问题。

我在 pandas 上使用 to_gbq 更新 Google BigQuery 并获取 GenericGBQException

我尝试了修复：

for col in df.columns:
    if df[col].dtypes == object:
        df[col] = df[col].fillna('')
        df[col] = df[col].astype(str)

不幸的是我仍然遇到同样的错误。 同样，尝试格式化丢失的数据并为 int 和 float 设置数据类型也会产生相同的错误。

有什么我想念的吗？

Answer 1

发现bigquery无法正确处理\\ r \\ n （有时\\ n \\ n ）也有同样的问题，本地化问题，当我用空格替换\\ r时我真的很惊讶：

for col in list(df.columns):
    df[col] = df[col].apply(lambda x: x.replace(u'\r', u' ') if isinstance(x, str) or isinstance(x, unicode) else x)

Answer 2

当我从云存储上的镶木地板文件导入bigquery时出现类似问题时，我已经好几次了。 但是，每次我都忘记了解决问题的方法，所以我希望在这里留下我的发现并不是太多违反协议的行为！

我意识到我的列都是NULL，看起来它们在pandas中有数据类型，但是如果你使用pyarrow.parquet.read_schema（parquet_file），你会看到数据类型为null。

删除列后，上传将起作用！

Answer 3

我在string列中有一些无效字符（ pandas object ）。 我使用@Echochi方法，它工作

for col in list(parsed_data.select_dtypes(include='object').columns):
        parsed_data[col] = parsed_data[col].apply(lambda x:re.sub('[^A-Za-z0-9]+','', str(x)))

对于接受的字符有一点限制，所以我使用了更普遍的方法，因为biquery与UTF-8 bigquery docs的兼容性

for col in list(parsed_data.select_dtypes(include='object').columns):
        parsed_data[col] = parsed_data[col].apply(lambda x:re.sub(r"[^\u0900-\u097F]+",,'?', str(x)))

使用r"[^\ऀ-\ॿ]+"您将接受所有兼容UTF-8字符集

Answer 4

由于数据集中存在不需要的列“未命名 0”，我遇到了类似的问题。 我删除了它，问题就解决了。 尝试查看形状和头部，如果数据集中存在任何 null 列或不需要的列

使用 pandas to_gbq 在 Google BigQuery 中创建表时读取数据时出现 400 错误

问题描述

4 个解决方案

解决方案1
2 2018-05-28 20:43:12

解决方案2
1 2018-12-03 16:49:36

解决方案3
0 2018-12-12 21:44:47

解决方案4
0 2023-01-16 14:34:04

使用 pandas to_gbq 在 Google BigQuery 中创建表时读取数据时出现 400 错误

问题描述

4 个解决方案

解决方案1 2 2018-05-28 20:43:12

解决方案2 1 2018-12-03 16:49:36

解决方案3 0 2018-12-12 21:44:47

解决方案4 0 2023-01-16 14:34:04

解决方案1
2 2018-05-28 20:43:12

解决方案2
1 2018-12-03 16:49:36

解决方案3
0 2018-12-12 21:44:47

解决方案4
0 2023-01-16 14:34:04