简体   繁体   English

BigQuery STRING 列中允许使用哪些字符(出现“UDF 内存不足”错误)

[英]Which characters are allowed in a BigQuery STRING column (getting "UDF out of memory" error)

I have a dataframe containing receipt-data.我有一个包含收据数据的 dataframe。 The column text in my dataframe contains the text from the receipt and seems to be an issue when I try to upload the data to BigQuery using df.to_gbq(...) since it produces the error我的 dataframe 中的列text包含收据中的文本,当我尝试使用df.to_gbq(...)将数据上传到 BigQuery 时似乎是个问题,因为它会产生错误

GenericGBQException: Reason: 400 Resources exceeded during query execution: UDF out of memory.; Failed to read Parquet file /some/file. 
This might happen if the file contains a row that is too large,
 or if the total size of the pages loaded for the queried columns is too large.

According to the error-message it seems to be an "memory error", but I have tried to convert all characters in each text to an "a" (to see if the strings contained to many characters) but that worked fine ie I doubt it is that.根据错误消息,它似乎是一个“内存错误”,但我试图将每个文本中的所有字符转换为“a”(以查看字符串是否包含许多字符)但效果很好,即我怀疑就是它。

I have tried converting all characters to utf8 by我试过将所有字符转换为utf8

df["text"] = df["text"].str.encode('utf-8') (since according to the docs they should be so) but that failed. df["text"] = df["text"].str.encode('utf-8') (因为根据文档他们应该是这样)但是失败了。 I have tried to replace "\n" with " " but that fails aswell.我试图用“”替换“\n”,但也失败了。

It seems like there's some values in my receipt-text that causes some troubles, but It's very difficult to figure out what (and since I have ~3 mio rows, it takes a while to try each and every row at a time) - are there any values that are not allowed in a big-query table?似乎我的收据文本中有一些值导致了一些麻烦,但很难弄清楚是什么(而且因为我有 ~3 mio 行,一次尝试每一行需要一段时间) - 是大查询表中有任何不允许的值吗?

It turns out that chunksize in to_gbq does not split up the chunks in the way I thought it did.事实证明, to_gbq中的chunksize并没有按照我认为的方式拆分块。 Manually looping over the dataframe in chunks like以块的形式手动循环 dataframe

CHUNKSIZE = 100_000
for i in range(0,df.shape[0]//CHUNKSIZE):
    print(i)
    df_temp = dataframe.iloc[i*CHUNKSIZE:(i+1)*CHUNKSIZE]
    df_temp.to_gbq(destination_table="Dataset.my_table",
    project_id = "my-project",
    if_exists="append",
    )

did the trick (setting chunksize=100_000 did not work)成功了(设置chunksize=100_000无效)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM