[英]Iterate a large .xz file line by line in python
I have a large .xz file (few gigabytes). 我有一个很大的.xz文件(几GB)。 It's full of plain text.
它充满了纯文本。 I want to process the text to create custom dataset.
我想处理文本以创建自定义数据集。 I want to read it line by line because it is too big.
我想逐行阅读,因为它太大了。 Anyone have an idea how to do it ?
任何人都有一个想法怎么做?
I already tried this How to open and read LZMA file in-memory but it's not working. 我已经尝试过此方法如何在内存中打开和读取LZMA文件,但是它不起作用。
EDIT: i got this error 'ascii' codec can't decode byte 0xfd in position 0: ordinal not in range(128) 编辑:我收到此错误'ascii'编解码器无法解码位置0的字节0xfd:序数不在范围内(128)
on the line for line in uncompressed:
from the link for line in uncompressed:
从链接
EDIT2: My code (using python 3.5) EDIT2:我的代码(使用python 3.5)
with open(filename) as compressed:
with lzma.LZMAFile(compressed) as uncompressed:
for line in uncompressed:
print(line)
I was faced to the same question some weeks ago. 几周前,我面临着同样的问题。 This snippet worked for me:
此代码段对我有用:
import lzma
with lzma.open('filename.xz', mode='rt') as file:
for line in file:
print(line)
This assumes that the text data in the compressed file was encoded in utf-8 (which was the case for my data). 假设压缩文件中的文本数据是用utf-8编码的(我的数据就是这种情况)。 There is an
encoding
argument in function lzma.open()
which allows you to set another encoding if needed 函数
lzma.open()
有一个encoding
参数,允许您根据需要设置其他编码
EDIT (after you own edit): try to force encoding='utf-8'
in lmza.open()
编辑(您自己编辑后):尝试在
lmza.open()
强制encoding='utf-8'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.