简体   繁体   English

在 python 中读取 bz2 文件的第一行

[英]Reading first lines of bz2 files in python

I am trying to extract 10'000 first lines from a bz2 file.我正在尝试从 bz2 文件中提取 10'000 首行。

   import bz2       
   file = "file.bz2"
   file_10000 = "file.txt"

   output_file = codecs.open(file_10000,'w+','utf-8')

   source_file = bz2.open(file, "r")
   count = 0
   for line in source_file:
       count += 1
       if count < 10000:
           output_file.writerow(line)

But I get an error "'module' object has no attribute 'open'".但是我收到一个错误“‘module’对象没有属性‘open’”。 Do you have any ideas?你有什么想法吗? Or may be I could save 10'000 first lines to a txt file in some other way?或者我可以以其他方式将 10'000 行保存到 txt 文件中? I am on Windows.我在 Windows 上。

Here is a fully working example that includes writing and reading a test file that is much smaller than your 10000 lines.这是一个完整的示例,其中包括编写和读取一个远小于 10000 行的测试文件。 Its nice to have working examples in questions so we can test easily.很高兴在问题中有工作示例,以便我们可以轻松测试。

import bz2
import itertools
import codecs

file = "file.bz2"
file_10000 = "file.txt"

# write test file with 9 lines
with bz2.BZ2File(file, "w") as fp:
    fp.write('\n'.join('123456789'))

# the original script using BZ2File ... and 3 lines for test
# ...and fixing bugs:
#     1) it only writes 9999 instead of 10000
#     2) files don't do writerow
#     3) close the files

output_file = codecs.open(file_10000,'w+','utf-8')

source_file = bz2.BZ2File(file, "r")
count = 0
for line in source_file:
    count += 1
    if count <= 3:
       output_file.write(line)
source_file.close()
output_file.close()

# show what you got
print('---- Test 1 ----')
print(repr(open(file_10000).read()))   

A more efficient way to do it is to break out of the for loop after reading the lines you want.更有效的方法是在阅读您想要的行后跳出for循环。 you can even leverage iterators to thin up the code like so:你甚至可以利用迭代器来精简代码,如下所示:

# a faster way to read first 3 lines
with bz2.BZ2File(file) as source_file,\
        codecs.open(file_10000,'w+','utf-8') as output_file:
    output_file.writelines(itertools.islice(source_file, 3))

# show what you got
print('---- Test 2 ----')
print(repr(open(file_10000).read()))   

This is definitely a simpler way of doing it than the other answer, but it would be an easy way to do so in both Python2/3.这绝对是比其他答案更简单的方法,但在 Python2/3 中这将是一种简单的方法。 Also, it would short-circuit if you don't have >= 10,000 lines.此外,如果您没有 >= 10,000 行,它会短路。

from bz2 import BZ2File as bzopen

# writing to a file
with bzopen("file.bz2", "w") as bzfout:
    for i in range(123456):
        bzfout.write(b"%i\n" % i)

# reading a bz2 archive
with bzopen("file.bz2", "r") as bzfin:
    """ Handle lines here """
    lines = []
    for i, line in enumerate(bzfin):
        if i == 10000: break
        lines.append(line.rstrip())

print(lines)

Just another variation.只是另一种变体。

import bz2

myfile =  'c:\\my_dir\\random.txt.bz2'
newfile = 'c:\\my_dir\\random_10000.txt'

stream = bz2.BZ2File(myfile)
with open(newfile, 'w') as f:
  for i in range(1,10000):
    f.write(stream.readline())

This worked for me:这对我有用:

sudo apt-get install python-dev
sudo pip install backports.lzma

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM