使用内存中的单个文件提取 bz2 文件

Question

I have a csv file compressed into a bz2 file that I'm trying to load from a website, decompress, and write to a local csv file by我有一个 csv 文件压缩成一个 bz2 文件，我正在尝试从网站加载、解压缩并写入本地 csv 文件

# Get zip file from website
archive = StringIO()
url_data = urllib2.urlopen(url)
archive.write(url_data.read())

# Extract the training data
data = bz2.decompress(archive.read())

# Write to csv
output_file = open('dataset_' + mode + '.csv', 'w')
output_file.write(data)

On the decompress call, I get IOError: invalid data stream .在解压缩调用中，我收到IOError: invalid data stream 。 As a note, the csv file contained in the archive has quite a few characters that could be causing some issues.请注意，存档中包含的 csv 文件有很多可能会导致一些问题的字符。 Particularly, if I try putting the file contents in unicode, I get an error about not being able to decode 0xfd .特别是，如果我尝试将文件内容放入 unicode 中，我会收到关于无法解码0xfd的错误。 I only have the single file within the archive, but I'm wondering if something could also be going on due to not extracting a specific file.我在存档中只有一个文件，但我想知道是否也可能由于未提取特定文件而发生某些事情。

Any ideas?有任何想法吗？

Answer 1

I suspect you are getting this error because the stream you are feeding the decompress() function is not a valid bz2 stream.我怀疑您收到此错误是因为您提供给decompress()函数的流不是有效的 bz2 流。

You must also "rewind" your StringIO buffer after writing to it.您还必须在写入StringIO缓冲区后“倒带”它。 See the notes below in comments.请参阅下面评论中的注释。 The following code (same as yours with the exception of imports, and the seek() fix) works if the URL points to a valid bz2 file.如果 URL 指向有效的 bz2 文件，则以下代码（与您的相同，但导入和seek()修复除外）。

from StringIO import StringIO
import urllib2
import bz2

# Get zip file from website
url = "http://www.7-zip.org/a/7z920.tar.bz2"  # just an example bz2 file

archive = StringIO()

# in case the request fails (e.g. 404, 500), this will raise
# a `urllib2.HTTPError`
url_data = urllib2.urlopen(url)

archive.write(url_data.read())

# will print how much compressed data you have buffered.
print "Length of file:", archive.tell()

# important!... make sure to reset the file descriptor read position
# to the start of the file.
archive.seek(0)

# Extract the training data
data = bz2.decompress(archive.read())

# Write to csv
output_file = open('output_file', 'w')
output_file.write(data)

re: encoding issues回复：编码问题

Generally, character encoding errors will generate UnicodeError (or one of its cousins), but not IOError .通常，字符编码错误会生成UnicodeError （或其UnicodeError之一），但不会生成IOError 。 IOError suggests something is wrong with the input, like truncation, or some error that would prevent the decompressor to do its work completely. IOError表明输入有问题，例如截断，或某些错误会阻止解压缩器完全完成其工作。

You have omitted the imports from your question, and one of the subtle differences between the StringIO and cStringIO (according to the docs ) is that cStringIO cannot work with unicode strings that cannot be converted to ascii.您已经从问题中省略了导入，并且StringIO和cStringIO之间的细微差别cStringIO （根据docs ）是cStringIO无法处理无法转换为 ascii 的 unicode 字符串。 That no longer seems to hold (in my tests at least), but it may be at play.这似乎不再成立（至少在我的测试中），但它可能在起作用。

Unlike the StringIO module, this module (cStringIO) is not able to accept Unicode strings that cannot be encoded as plain ASCII strings.与 StringIO 模块不同，此模块 (cStringIO) 无法接受无法编码为纯 ASCII 字符串的 Unicode 字符串。

使用内存中的单个文件提取 bz2 文件

问题描述

1 个解决方案

解决方案1
2 已采纳 2015-11-20 00:14:30

使用内存中的单个文件提取 bz2 文件

问题描述

1 个解决方案

解决方案1 2 已采纳 2015-11-20 00:14:30

解决方案1
2 已采纳 2015-11-20 00:14:30