![](/img/trans.png)
[英]MemoryError when Using the read() Method in Reading a Large Size of JSON file from Amazon S3
[英]Reading a large file from S3 into a dataframe
當我嘗試將大小超過 2GB 的文件讀取到 dataframe 時,出現 followinbg 錯誤:OverflowError: signed integer is greater than maximum
這是在https://bugs.python.org/issue42853中提到的
有解決方法嗎?
如錯誤中所述,使用緩沖區讀取文件。 請記住,您仍在將數據加載到您的 ram 中。 所以你的系統應該仍然有足夠大的內存來存儲數據。 否則你會出現 memory 錯誤。
現有代碼
s3_resource = boto3.resource()
s3_client = boto3.client()
s3_obj = s3_resource.Object(bucket_name, filename).get()
with io.BytesIO(s3_obj["Body"].read()) as file:
file_as_df = pd.read_csv(file, encoding='latin1',sep='\t')
修改代碼
response = s3_client.get_object(Bucket= bucket_name , Key = filename)
#os.path.join(key, datafile) #ignore this
buf = bytearray(response['ContentLength'])
view = memoryview(buf)
pos = 0
while True:
chunk = response['Body'].read(67108864)
if len(chunk) == 0:
break
view[pos:pos+len(chunk)] = chunk
pos += len(chunk)
file_as_df = pd.read_csv(io.BytesIO(bytes(view)), encoding='latin1',sep='\t')
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.