Python：Stream 來自 s3 的 gzip 文件

Question

我在 s3 中有文件作為 gzip 塊，因此我必須連續讀取數據並且不能讀取隨機數據。 我總是必須從第一個文件開始。

例如，假設我在 s3、 f1.gz 、 f2.gz 、 f3.gz中有 3 個 gzip 文件。 如果我全部下載到本地，我可以做cat * | gzip -d cat * | gzip -d 。 如果我做cat f2.gz | gzip -d cat f2.gz | gzip -d ，它將失敗並顯示gzip: stdin: not in gzip format 。

我如何使用 python 從 s3 中獲取這些數據 stream？ 我看到 smart-open 並且它有解壓縮 gz 文件的能力

from smart_open import smart_open, open

with open(path, compression='.gz') as f:
    for line in f:
        print(line.strip())

其中 path 是f1.gz的路徑。 這一直有效，直到它到達文件末尾，它將在此處中止。 如果我執行cat f1.gz | gzip -d ，本地也會發生同樣的事情。 cat f1.gz | gzip -d ，它會報錯gzip: stdin: unexpected end of file當它到達結尾時。

有沒有辦法讓 stream 文件連續使用 python？

這個不會中止，並且可以遍歷f1.gz ， f2.gz和f3.gz

with open(path, 'rb', compression='disable') as f:
    for line in f:
        print(line.strip(), end="")

但 output 只是字節。 我在想它會通過做python test.py | gzip -d來工作 python test.py | gzip -d ，使用上面的代碼，但出現錯誤gzip: stdin: not in gzip format 。 有沒有辦法使用 gzip 可以讀取的 smart-open 打印 python？

Answer 1

例如，假設我在 s3、 f1.gz 、 f2.gz 、 f3.gz中有 3 個 gzip 文件。 如果我全部下載到本地，我可以做cat * | gzip -d cat * | gzip -d 。

一個想法是制作一個文件 object 來實現它。 文件 object 從一個文件句柄讀取，耗盡它，從下一個文件句柄讀取，耗盡它，等等。這類似於cat內部的工作方式。

這樣做的便利之處在於，它的作用與連接所有文件相同，而無需使用 memory 同時讀取所有文件。

獲得組合文件 object 包裝器后，您可以將其傳遞給 Python 的gzip模塊以解壓縮文件。

例子：

import gzip

class ConcatFileWrapper:
    def __init__(self, files):
        self.files = iter(files)
        self.current_file = next(self.files)
    def read(self, *args):
        ret = self.current_file.read(*args)
        if len(ret) == 0:
            # EOF
            # Optional: close self.current_file here
            # self.current_file.close()
            # Advance to next file and try again
            try:
                self.current_file = next(self.files)
            except StopIteration:
                # Out of files
                # Return an empty string
                return ret
            # Recurse and try again
            return self.read(*args)
        return ret
    def write(self):
        raise NotImplementedError()

filenames = ["xaa", "xab", "xac", "xad"]
filehandles = [open(f, "rb") for f in filenames]
wrapper = ConcatFileWrapper(filehandles)

with gzip.open(wrapper) as gf:
    for line in gf:
        print(line)

# Close all files
[f.close() for f in filehandles]

這是我測試的方法：

我創建了一個文件來通過以下命令對此進行測試。

創建內容為 1 到 1000 的文件。

$ seq 1 1000 > foo

壓縮它。

$ gzip foo

拆分文件。 這會生成四個名為 xaa-xad 的文件。

$ split -b 500 foo.gz

在上面運行上面的 Python 文件，它應該打印出 1 - 1000。

編輯：關於延遲打開文件的額外說明

如果您有大量文件，您可能希望一次只打開一個文件。 這是一個例子：

def open_files(filenames):
    for filename in filenames:
        # Note: this will leak file handles unless you uncomment the code above that closes the file handles again.
        yield open(filename, "rb")

Python：Stream 來自 s3 的 gzip 文件

問題描述

1 個解決方案

解決方案1
2 已采納 2022-03-31 03:52:56

編輯：關於延遲打開文件的額外說明

Python：Stream 來自 s3 的 gzip 文件

問題描述

1 個解決方案

解決方案1 2 已采納 2022-03-31 03:52:56

編輯：關於延遲打開文件的額外說明

解決方案1
2 已采納 2022-03-31 03:52:56