尝试使用Python和smart_open动态解压缩大文件时出现“EOFError：已到达流末尾”错误

Question

I'm trying to download and decompress a set of files from a remote Apache server. 我正在尝试从远程Apache服务器下载和解压缩一组文件。 I provide a list of .tbz (tar.bz2) files to be downloaded and decompressed on the fly. 我提供了一个.tbz（tar.bz2）文件列表，可以动态下载和解压缩。 The goal is to stream them from the remote Apache server through the tar decompressor and immediately stream them to my Amazon AWS S3 bucket. 目标是通过tar解压缩器从远程Apache服务器流式传输它们，并立即将它们流式传输到我的Amazon AWS S3存储桶。 I do this because files can be as large as 30Gb. 我这样做是因为文件可以大到30Gb。

I use the "smart_open" python library to abstract away https and s3 management. 我使用“smart_open”python库抽象出https和s3管理。

The code I provide here works fine for small files. 我在这里提供的代码适用于小文件。 As soon as I'm trying to do this with a larger file (over 8Mb), I get the following error: 一旦我尝试用更大的文件（超过8Mb）执行此操作，我会收到以下错误：

"EOFError: End of stream already reached"

Here's the traceback: 这是追溯：

Traceback (most recent call last):
  File "./script.py", line 28, in <module>
    download_file(fileName)
  File "./script.py", line 21, in download_file
    for line in tfext:
  File "/.../lib/python3.7/tarfile.py", line 706, in readinto
    buf = self.read(len(b))
  File "/.../lib/python3.7/tarfile.py", line 695, in read
    b = self.fileobj.read(length)
  File "/.../lib/python3.7/tarfile.py", line 537, in read
    buf = self._read(size)
  File "/.../lib/python3.7/tarfile.py", line 554, in _read
    buf = self.cmp.decompress(buf)
EOFError: End of stream already reached

When I print out the lines I'm writing to the stream, I can see that I'm still getting through the first fraction of the file before the error is being thrown. 当我打印出我正在写入流的行时，我可以看到在错误被抛出之前我仍然通过文件的第一部分。

What I've tried so far: 到目前为止我尝试过的：

I've tried to specify the same buffer size for both open() and tarfile.open() without success. 我试图为open（）和tarfile.open（）指定相同的缓冲区大小，但没有成功。
I've also tried to introduce some delay between writing of each line to no avail either. 我也尝试在每行写入之间引入一些延迟也无济于事。

from smart_open import open
import tarfile

baseUrl = 'https://someurlpath/'
filesToDownload = ['name_of_file_to_download']

def download_file(fileName):
    fileUrl = baseUrl + fileName + '.tbz'
    with open(fileUrl, 'rb') as fin:
        with tarfile.open(fileobj=fin, mode='r|bz2') as tf:
            destination = 's3://some_aws_path/' + fileName + '.csv'
            with open(destination, 'wb') as fout:
                with tf.extractfile(tf.next()) as tfext:
                    for line in tfext:
                        fout.write(line)


for fileName in filesToDownload:
    download_file(fileName)

I want to be able to process large files exactly the same way I'm able to process small ones. 我希望能够处理大文件的方式与我能够处理小文件的方式完全相同。

Answer 1

The compressed tar extraction requires file seek which may not be possible with virtual file descriptor as created by smart_open. 压缩的tar提取需要文件搜索，这可能是由smart_open创建的虚拟文件描述符无法实现的。 Alternative is to download data to block storage before processing. 另一种方法是在处理之前下载数据以阻止存储。

from smart_open import open
import tarfile
import boto3
from codecs import open as copen

filenames = ['test.tar.bz2',]

def download_file(fileName):
   s3 = boto3.resource('s3')
   bucket = s3.Bucket('bucketname')
   obj = bucket.Object(fileName)
   local_filename = '/tmp/{}'.format(fileName)
   obj.download_file(local_filename)
   tf = tarfile.open(local_filename, 'r:bz2')
   for member in tf.getmembers():
      tf.extract(member)
      fd = open(member.name, 'rb')
      print(member, len(fd.read()))
if __name__ == '__main__':
   for f in filenames:
      download_file(f)

尝试使用Python和smart_open动态解压缩大文件时出现“EOFError：已到达流末尾”错误

问题描述

1 个解决方案

解决方案1
0 2019-06-04 07:10:33

尝试使用Python和smart_open动态解压缩大文件时出现“EOFError：已到达流末尾”错误

问题描述

1 个解决方案

解决方案1 0 2019-06-04 07:10:33

解决方案1
0 2019-06-04 07:10:33