簡體   English   中英

以塊的形式從 s3 讀取壓縮的 JSON 文件並將每個塊寫入鑲木地板

[英]Read compressed JSON file from s3 in chunks and write each chunk to parquet

我正在嘗試從 s3 分塊獲取 20GB JSON gzip 文件,解壓縮每個塊,將塊轉換為鑲木地板,然后將其保存到另一個存儲桶。

我有以下代碼可以很好地處理較小的文件,但是,當我嘗試使用 20GB 文件執行此操作時,我會在下面得到以下回溯。 我不完全確定如何解決這個問題。

Traceback (most recent call last):
    data = gzip.decompress(chunk) "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/gzip.py", line 548, in decompress
    return f.read()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/gzip.py", line 292, in read
    return self._buffer.read(size)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/gzip.py", line 498, in read
    raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached
Traceback (most recent call last):
  File "/Users/samlambert/Desktop/Repos/hdns/src/collection/data_converter.py", line 33, in <module>
    data = gzip.decompress(chunk)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/gzip.py", line 548, in decompress
    return f.read()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/gzip.py", line 292, in read
    return self._buffer.read(size)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/gzip.py", line 498, in read
    raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached

with S3() as s3:
    # get s3 object (20GB gzipped JSON file)
    obj = s3.s3_client.get_object(Bucket=input_bucket, Key=object_key)

    # Separate the file into chunks
    for chunk in obj['Body'].iter_chunks():
        
        decompress and decode
        data = gzip.decompress(chunk)
        text = data.decode('utf-8')

        # At this point chunk is one string with multiple lines of JSON
        # We convert each line into it's own JSON, then append json_data
        data_in_list = text.splitlines()
        json_data = []
        for data in data_in_list:
            string_to_json = json.loads(data)
            json_data.append(string_to_json)

        # Convert list of JSON objs into one dataframe
        df = pd.DataFrame(json_data)
        
        convert df to parquet and save to s3
        parquet_filename = 'df.parquet.gzip'
        df.to_parquet(parquet_filename, index=False)
        s3_url = 's3://mybucket/parquet_test/bucket.parquet.gzip'
        df.to_parquet(s3_url, compression='gzip')

編輯:

所以,我認為使用 Pandas 我可以更直接地做到這一點:

with S3Connect() as s3:
    obj = s3.s3_client.get_object(Bucket=input_bucket, Key=object_key)
    count = 0
    for df in pd.read_json("s3://path/to/file.json.gz", lines=True, chunksize=50000000):
        count += 1
        parquet_filename = f'df_{str(count)}.parquet.gzip'
        df.to_parquet(parquet_filename, index=False)
        s3_url = f's3://parquet_test/{parquet_filename}'
        df.to_parquet(s3_url, compression='gzip')
        # The file is also being saved locally, which I don't want
        # so I'm just doing this to remove it
        os.remove(parquet_filename)

Gzip 需要知道文件何時結束,因此通過一次傳遞一小塊數據,您實際上是在告訴它這些是小 gzip 文件,因為它們提前結束,所以它失敗了。 另一方面, gzip.open可以傳遞一個文件,或類似文件的 object,它會使用read的返回值來知道文件何時結束。

因此,您可以簡單地將 output object 從get_object傳遞給它,讓它根據需要從 S3 object 請求數據,並了解 gzip 文件何時結束。

s3 = boto3.client('s3')
s3_object = s3.get_object(Bucket=input_bucket, Key=object_key)['Body']
with gzip.open(s3_object, "r") as f:
    for row in f:
        row = json.loads(row)
        # TODO: Handle each row as it comes in...
        print(row)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM