简体   繁体   English

尝试使用Python和smart_open动态解压缩大文件时出现“EOFError:已到达流末尾”错误

[英]Getting a “EOFError: End of stream already reached” error when trying to untar a large file on the fly with Python and smart_open

I'm trying to download and decompress a set of files from a remote Apache server. 我正在尝试从远程Apache服务器下载和解压缩一组文件。 I provide a list of .tbz (tar.bz2) files to be downloaded and decompressed on the fly. 我提供了一个.tbz(tar.bz2)文件列表,可以动态下载和解压缩。 The goal is to stream them from the remote Apache server through the tar decompressor and immediately stream them to my Amazon AWS S3 bucket. 目标是通过tar解压缩器从远程Apache服务器流式传输它们,并立即将它们流式传输到我的Amazon AWS S3存储桶。 I do this because files can be as large as 30Gb. 我这样做是因为文件可以大到30Gb。

I use the "smart_open" python library to abstract away https and s3 management. 我使用“smart_open”python库抽象出https和s3管理。

The code I provide here works fine for small files. 我在这里提供的代码适用于小文件。 As soon as I'm trying to do this with a larger file (over 8Mb), I get the following error: 一旦我尝试用更大的文件(超过8Mb)执行此操作,我会收到以下错误:

"EOFError: End of stream already reached"

Here's the traceback: 这是追溯:

Traceback (most recent call last):
  File "./script.py", line 28, in <module>
    download_file(fileName)
  File "./script.py", line 21, in download_file
    for line in tfext:
  File "/.../lib/python3.7/tarfile.py", line 706, in readinto
    buf = self.read(len(b))
  File "/.../lib/python3.7/tarfile.py", line 695, in read
    b = self.fileobj.read(length)
  File "/.../lib/python3.7/tarfile.py", line 537, in read
    buf = self._read(size)
  File "/.../lib/python3.7/tarfile.py", line 554, in _read
    buf = self.cmp.decompress(buf)
EOFError: End of stream already reached

When I print out the lines I'm writing to the stream, I can see that I'm still getting through the first fraction of the file before the error is being thrown. 当我打印出我正在写入流的行时,我可以看到在错误被抛出之前我仍然通过文件的第一部分。

What I've tried so far: 到目前为止我尝试过的:

  1. I've tried to specify the same buffer size for both open() and tarfile.open() without success. 我试图为open()和tarfile.open()指定相同的缓冲区大小,但没有成功。

  2. I've also tried to introduce some delay between writing of each line to no avail either. 我也尝试在每行写入之间引入一些延迟也无济于事。

from smart_open import open
import tarfile

baseUrl = 'https://someurlpath/'
filesToDownload = ['name_of_file_to_download']

def download_file(fileName):
    fileUrl = baseUrl + fileName + '.tbz'
    with open(fileUrl, 'rb') as fin:
        with tarfile.open(fileobj=fin, mode='r|bz2') as tf:
            destination = 's3://some_aws_path/' + fileName + '.csv'
            with open(destination, 'wb') as fout:
                with tf.extractfile(tf.next()) as tfext:
                    for line in tfext:
                        fout.write(line)


for fileName in filesToDownload:
    download_file(fileName)

I want to be able to process large files exactly the same way I'm able to process small ones. 我希望能够处理大文件的方式与我能够处理小文件的方式完全相同。

The compressed tar extraction requires file seek which may not be possible with virtual file descriptor as created by smart_open. 压缩的tar提取需要文件搜索,这可能是由smart_open创建的虚拟文件描述符无法实现的。 Alternative is to download data to block storage before processing. 另一种方法是在处理之前下载数据以阻止存储。

from smart_open import open
import tarfile
import boto3
from codecs import open as copen

filenames = ['test.tar.bz2',]

def download_file(fileName):
   s3 = boto3.resource('s3')
   bucket = s3.Bucket('bucketname')
   obj = bucket.Object(fileName)
   local_filename = '/tmp/{}'.format(fileName)
   obj.download_file(local_filename)
   tf = tarfile.open(local_filename, 'r:bz2')
   for member in tf.getmembers():
      tf.extract(member)
      fd = open(member.name, 'rb')
      print(member, len(fd.read()))
if __name__ == '__main__':
   for f in filenames:
      download_file(f)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 smart_open Python 库 - smart_open Python Library 在python中下载大文件错误:压缩文件在到达流末尾标记之前结束 - Downloading large file in python error: Compressed file ended before the end-of-stream marker was reached raise EOFError('压缩文件在“ EOFError 之前结束:压缩文件在到达流结束标记之前结束') - raise EOFError('Compressed file ended before the " EOFError: Compressed file ended before the end-of-stream marker was reached') 在大型数据库上使用 smart_open 连接以使用 python 写入 s3 - using smart_open on a large database connect to write to s3 with python Google Cloud API:EOFError:压缩文件在到达流结束标记之前结束 - Google Cloud API: EOFError: Compressed file ended before the end-of-stream marker was reached "EOFError:压缩文件在到达流结束标记之前结束 - MNIST 数据集" - EOFError: Compressed file ended before the end-of-stream marker was reached - MNIST data set 假设在 smart_open (python) 中的 AWS 角色不起作用 - Assuming AWS role in smart_open (python) doesn't work 尝试解压缩远程文件时,Python 2.7 fabric / paramiko EOF - Python 2.7 fabric/paramiko EOF when trying to untar a remote file python压缩4Gb bz2 EOFError:已经找到流的末尾嵌套的子文件夹 - python compressed 4Gb bz2 EOFError: end of stream was already found nested subfolders 如何使用 smart_open 将流写入 KMS 加密的 S3 存储桶? - How to use smart_open to write a stream to a KMS-encrypted S3 bucket?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM