BigQuery STRING 列中允许使用哪些字符（出现“UDF 内存不足”错误）

Question

I have a dataframe containing receipt-data.我有一个包含收据数据的 dataframe。 The column text in my dataframe contains the text from the receipt and seems to be an issue when I try to upload the data to BigQuery using df.to_gbq(...) since it produces the error我的 dataframe 中的列text包含收据中的文本，当我尝试使用df.to_gbq(...)将数据上传到 BigQuery 时似乎是个问题，因为它会产生错误

GenericGBQException: Reason: 400 Resources exceeded during query execution: UDF out of memory.; Failed to read Parquet file /some/file. 
This might happen if the file contains a row that is too large,
 or if the total size of the pages loaded for the queried columns is too large.

According to the error-message it seems to be an "memory error", but I have tried to convert all characters in each text to an "a" (to see if the strings contained to many characters) but that worked fine ie I doubt it is that.根据错误消息，它似乎是一个“内存错误”，但我试图将每个文本中的所有字符转换为“a”（以查看字符串是否包含许多字符）但效果很好，即我怀疑就是它。

I have tried converting all characters to utf8 by我试过将所有字符转换为utf8

df["text"] = df["text"].str.encode('utf-8') (since according to the docs they should be so) but that failed. df["text"] = df["text"].str.encode('utf-8') （因为根据文档他们应该是这样）但是失败了。 I have tried to replace "\n" with " " but that fails aswell.我试图用“”替换“\n”，但也失败了。

It seems like there's some values in my receipt-text that causes some troubles, but It's very difficult to figure out what (and since I have ~3 mio rows, it takes a while to try each and every row at a time) - are there any values that are not allowed in a big-query table?似乎我的收据文本中有一些值导致了一些麻烦，但很难弄清楚是什么（而且因为我有 ~3 mio 行，一次尝试每一行需要一段时间） - 是大查询表中有任何不允许的值吗？

Answer 1

It turns out that chunksize in to_gbq does not split up the chunks in the way I thought it did.事实证明， to_gbq中的chunksize并没有按照我认为的方式拆分块。 Manually looping over the dataframe in chunks like以块的形式手动循环 dataframe

CHUNKSIZE = 100_000
for i in range(0,df.shape[0]//CHUNKSIZE):
    print(i)
    df_temp = dataframe.iloc[i*CHUNKSIZE:(i+1)*CHUNKSIZE]
    df_temp.to_gbq(destination_table="Dataset.my_table",
    project_id = "my-project",
    if_exists="append",
    )

did the trick (setting chunksize=100_000 did not work)成功了（设置chunksize=100_000无效）

BigQuery STRING 列中允许使用哪些字符（出现“UDF 内存不足”错误）

问题描述

1 个解决方案

解决方案1
0 2022-05-05 18:50:46

BigQuery STRING 列中允许使用哪些字符（出现“UDF 内存不足”错误）

问题描述

1 个解决方案

解决方案1 0 2022-05-05 18:50:46

解决方案1
0 2022-05-05 18:50:46