在python中逐行迭代大型.xz文件

Question

I have a large .xz file (few gigabytes). 我有一个很大的.xz文件（几GB）。 It's full of plain text. 它充满了纯文本。 I want to process the text to create custom dataset. 我想处理文本以创建自定义数据集。 I want to read it line by line because it is too big. 我想逐行阅读，因为它太大了。 Anyone have an idea how to do it ? 任何人都有一个想法怎么做？

I already tried this How to open and read LZMA file in-memory but it's not working. 我已经尝试过此方法如何在内存中打开和读取LZMA文件，但是它不起作用。

EDIT: i got this error 'ascii' codec can't decode byte 0xfd in position 0: ordinal not in range(128) 编辑：我收到此错误'ascii'编解码器无法解码位置0的字节0xfd：序数不在范围内（128）

on the line for line in uncompressed: from the link for line in uncompressed:从链接

EDIT2: My code (using python 3.5) EDIT2：我的代码（使用python 3.5）

with open(filename) as compressed:
with lzma.LZMAFile(compressed) as uncompressed:
    for line in uncompressed:
        print(line)

Answer 1

I was faced to the same question some weeks ago. 几周前，我面临着同样的问题。 This snippet worked for me: 此代码段对我有用：

import lzma
with lzma.open('filename.xz', mode='rt') as file:
    for line in file:
       print(line)

This assumes that the text data in the compressed file was encoded in utf-8 (which was the case for my data). 假设压缩文件中的文本数据是用utf-8编码的（我的数据就是这种情况）。 There is an encoding argument in function lzma.open() which allows you to set another encoding if needed 函数lzma.open()有一个encoding参数，允许您根据需要设置其他编码

EDIT (after you own edit): try to force encoding='utf-8' in lmza.open() 编辑（您自己编辑后）：尝试在lmza.open()强制encoding='utf-8'

在python中逐行迭代大型.xz文件

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-03-18 13:04:29

在python中逐行迭代大型.xz文件

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-03-18 13:04:29

解决方案1
2 已采纳 2018-03-18 13:04:29