從 S3 讀取大文件到 dataframe

Question

當我嘗試將大小超過 2GB 的文件讀取到 dataframe 時，出現 followinbg 錯誤：OverflowError: signed integer is greater than maximum

這是在https://bugs.python.org/issue42853中提到的

有解決方法嗎？

Answer 1

如錯誤中所述，使用緩沖區讀取文件。 請記住，您仍在將數據加載到您的 ram 中。 所以你的系統應該仍然有足夠大的內存來存儲數據。 否則你會出現 memory 錯誤。

現有代碼

s3_resource = boto3.resource()
s3_client = boto3.client()

s3_obj = s3_resource.Object(bucket_name, filename).get()
    with io.BytesIO(s3_obj["Body"].read()) as file:
        file_as_df = pd.read_csv(file, encoding='latin1',sep='\t')

修改代碼

response = s3_client.get_object(Bucket= bucket_name , Key = filename)
#os.path.join(key, datafile) #ignore this
buf = bytearray(response['ContentLength'])
view = memoryview(buf)
pos = 0
while True:
    chunk = response['Body'].read(67108864)
    if len(chunk) == 0:
        break
    view[pos:pos+len(chunk)] = chunk
    pos += len(chunk)

file_as_df = pd.read_csv(io.BytesIO(bytes(view)), encoding='latin1',sep='\t')

從 S3 讀取大文件到 dataframe

問題描述

1 個解決方案

解決方案1
0 2022-03-31 07:55:01

從 S3 讀取大文件到 dataframe

問題描述

1 個解決方案

解決方案1 0 2022-03-31 07:55:01

解決方案1
0 2022-03-31 07:55:01