[英]Read compressed JSON file from s3 in chunks and write each chunk to parquet
我正在嘗試從 s3 分塊獲取 20GB JSON gzip 文件,解壓縮每個塊,將塊轉換為鑲木地板,然后將其保存到另一個存儲桶。
我有以下代碼可以很好地處理較小的文件,但是,當我嘗試使用 20GB 文件執行此操作時,我會在下面得到以下回溯。 我不完全確定如何解決這個問題。
Traceback (most recent call last):
data = gzip.decompress(chunk) "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/gzip.py", line 548, in decompress
return f.read()
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/gzip.py", line 292, in read
return self._buffer.read(size)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/gzip.py", line 498, in read
raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached
Traceback (most recent call last):
File "/Users/samlambert/Desktop/Repos/hdns/src/collection/data_converter.py", line 33, in <module>
data = gzip.decompress(chunk)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/gzip.py", line 548, in decompress
return f.read()
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/gzip.py", line 292, in read
return self._buffer.read(size)
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/gzip.py", line 498, in read
raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached
with S3() as s3:
# get s3 object (20GB gzipped JSON file)
obj = s3.s3_client.get_object(Bucket=input_bucket, Key=object_key)
# Separate the file into chunks
for chunk in obj['Body'].iter_chunks():
decompress and decode
data = gzip.decompress(chunk)
text = data.decode('utf-8')
# At this point chunk is one string with multiple lines of JSON
# We convert each line into it's own JSON, then append json_data
data_in_list = text.splitlines()
json_data = []
for data in data_in_list:
string_to_json = json.loads(data)
json_data.append(string_to_json)
# Convert list of JSON objs into one dataframe
df = pd.DataFrame(json_data)
convert df to parquet and save to s3
parquet_filename = 'df.parquet.gzip'
df.to_parquet(parquet_filename, index=False)
s3_url = 's3://mybucket/parquet_test/bucket.parquet.gzip'
df.to_parquet(s3_url, compression='gzip')
編輯:
所以,我認為使用 Pandas 我可以更直接地做到這一點:
with S3Connect() as s3:
obj = s3.s3_client.get_object(Bucket=input_bucket, Key=object_key)
count = 0
for df in pd.read_json("s3://path/to/file.json.gz", lines=True, chunksize=50000000):
count += 1
parquet_filename = f'df_{str(count)}.parquet.gzip'
df.to_parquet(parquet_filename, index=False)
s3_url = f's3://parquet_test/{parquet_filename}'
df.to_parquet(s3_url, compression='gzip')
# The file is also being saved locally, which I don't want
# so I'm just doing this to remove it
os.remove(parquet_filename)
Gzip 需要知道文件何時結束,因此通過一次傳遞一小塊數據,您實際上是在告訴它這些是小 gzip 文件,因為它們提前結束,所以它失敗了。 另一方面, gzip.open
可以傳遞一個文件,或類似文件的 object,它會使用read
的返回值來知道文件何時結束。
因此,您可以簡單地將 output object 從get_object
傳遞給它,讓它根據需要從 S3 object 請求數據,並了解 gzip 文件何時結束。
s3 = boto3.client('s3')
s3_object = s3.get_object(Bucket=input_bucket, Key=object_key)['Body']
with gzip.open(s3_object, "r") as f:
for row in f:
row = json.loads(row)
# TODO: Handle each row as it comes in...
print(row)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.