如何避免使用readlines（）？

Question

我需要处理超大的txt输入文件，我通常使用.readlines（）来首先读取整个文件，并将其转换为列表。

我知道这确实是内存成本，而且可能很慢，但我还需要利用LIST特性来操作特定的行，如下所示：

#!/usr/bin/python

import os,sys
import glob
import commands
import gzip

path= '/home/xxx/scratch/'
fastqfiles1=glob.glob(path+'*_1.recal.fastq.gz')

for fastqfile1 in fastqfiles1:
    filename = os.path.basename(fastqfile1)
    job_id = filename.split('_')[0]
    fastqfile2 = os.path.join(path+job_id+'_2.recal.fastq.gz') 

    newfastq1 = os.path.join(path+job_id+'_1.fastq.gz') 
    newfastq2 = os.path.join(path+job_id+'_2.fastq.gz') 

    l1= gzip.open(fastqfile1,'r').readlines()
    l2= gzip.open(fastqfile2,'r').readlines()
    f1=[]
    f2=[]
    for i in range(0,len(l1)):
        if i % 4 == 3:
           b1=[ord(x) for x in l1[i]]
           ave1=sum(b1)/float(len(l1[i]))
           b2=[ord(x) for x in str(l2[i])]
           ave2=sum(b2)/float(len(l2[i]))
           if (ave1 >= 20 and ave2>= 20):
              f1.append(l1[i-3])
              f1.append(l1[i-2])
              f1.append(l1[i-1])
              f1.append(l1[i])
              f2.append(l2[i-3])
              f2.append(l2[i-2])
              f2.append(l2[i-1])
              f2.append(l2[i])
    output1=gzip.open(newfastq1,'w')
    output1.writelines(f1)
    output1.close()
    output2=gzip.open(newfastq2,'w')
    output2.writelines(f2)
    output2.close()

一般来说，我试图读取整个文本的每4行，但如果第4行符合所需条件，我会将这4行附加到文本中。 那么我可以避免readlines（）来实现这一目标吗？ 谢谢

编辑：嗨，实际上我自己找到了一个更好的方法：

import commands
 l1=commands.getoutput('zcat ' + fastqfile1).splitlines(True)
 l2=commands.getoutput('zcat ' + fastqfile2).splitlines(True)

我认为'zcat'速度超快......读取线需要大约15分钟，而zcat只需要1分钟...

Answer 1

如果您可以重构代码以线性读取文件，那么您可以说for line in file可以遍历for line in file每一行而不会立即将其全部读入内存。 但是，由于您的文件访问看起来更复杂，您可以使用生成器来替换readlines() 。 一种方法是使用itertools.izip或itertools.izip_longest ：

def four_at_a_time(iterable):
    """Returns an iterator that returns a 4-tuple of objects at a time from the
       given iterable"""
    args = [iter(iterable) * 4]
    return itertools.izip(*args)
...
l1 = four_at_a_time(gzip.open(fastqfile1, 'r'))
l2 = four_at_a_time(gzip.open(fastqfile2, 'r'))
for i, x in enumerate(itertools.izip(l1, l2))
    # x is now a 2-tuple of 4-tuples of lines (one 4-tuple of lines from the first file,
    # and one 4-tuple of lines from the second file).  Process accordingly.

Answer 2

一个简单的方法是，

（伪代码，可能包含错误，仅用于说明目的）

    a=gzip.open()
    b=gzip.open()

    last_four_a_lines=[]
    last_four_b_lines=[]

    idx=0

    new_a=[]
    new_b=[]

    while True:
      la=a.readline()
      lb=b.readline()
      if (not la) or (not lb):
        break

      if idx % 4==3:
        a_calc=sum([ something ])/len(la)
        b_calc=sum([ something ])/len(lb)
        if a_calc and b_calc:
          for line in last_four_a_lines:
          new_a.append(line)
          for line in last_four_b_lines:
          new_b.append(line)

      last_four_a_lines.append(la)
      del(last_four_a_lines[0])
      last_four_b_lines.append(lb)
      del(last_four_b_lines[0])
      idx+=1
a.close()
b.close()

Answer 3

您可以使用enumerate来迭代文件中的行，这将在每次迭代时返回计数和行：

with open(file_name) as f:
    for i, line in enumerate(f):
        if i % 4 == 3:
            print i, line

Answer 4

以下是如何打印包含foo和前3行的所有行：

f = open(...)
prevlines = []
for line in f:
  prevlines.append(line)
  del prevlines[:-4]
  if 'foo' in line:
    print prevlines

如果您一次读取2个文件（行数相同），请执行以下操作：

f1 = open(...)
f2 = open(...)
prevlines1 = []
for line1 in f1:
  prevlines1.append(line1)
  del prevlines1[:-4]
  line2 = f2.readline()
  prevlines2.append(line2)
  del prevlines2[:-4]
  if 'foo' in line1 and 'bar' in line2:
    print prevlines1, prevlines2

Answer 5

棘手，因为你实际上有两个同时处理的文件。

您可以使用fileinput模块一次有效地解析一行文件。 它也可用于解析文件列表，您可以使用块中的fileinput.nextfile（）方法并行切换多个文件，一次消耗每个文件中的一行。

fileinput.lineno（）方法甚至会为您提供当前文件中的当前行号。 您可以在循环体中使用临时列表来跟踪您的4行块。

完全未经测试的特殊代码，可能基于对代码所做的误解，如下：

f1 = []
f2 = []
for line in fileinput(filename1, filename2):
    if fileinput.filename() = filename1:
        f1.append(line)
    else:
        f2.append(line)
        if fileinput.lineno() % 4 == 3:
            doMyProcesing()
            f1 = []; f2 = []
    fileinput.nextfile()

Answer 6

我认为改进l1和l2的获取是不够的：你必须全局改进你的代码

我提议：

#!/usr/bin/python

import os
import sys
import gzip

path= '/home/xxx/scratch/'

def gen(gfa,gfb):
    try:
        a = (gfa.readline(),gfa.readline(),gfa.readline(),gfa.readline())
        b = (gfb.readline(),gfb.readline(),gfb.readline(),gfb.readline())
        if sum(imap(ord,a[3]))/float(len(a[3])) >= 20 \
           and sum(imap(ord,b[3]))/float(len(b[3])) >= 20:
            yield (a,b)
    except:
        break

for fastqfile1 in glob.glob(path + '*_1.recal.fastq.gz') :
    pji = path + os.path.basename(fastqfile1).split('_')[0] # pji = path + job_id

    gf1= gzip.open(fastqfile1,'r')
    gf2= gzip.open(os.path.join(pji + '_2.recal.fastq.gz'),'r')

    output1=gzip.open(os.path.join(pji + '_1.fastq.gz'),'w')
    output2=gzip.open(os.path.join(pji + '_2.fastq.gz'),'w')

    for lines1,lines2 in gen(gf1,gf2):
        output1.writelines(lines1)
        output2.writelines(lines2)

    output1.close()
    output2.close()

它应该将执行时间减少30％。 纯粹的猜测。

PS：

码

if sum(imap(ord,a[3]))/float(len(a[3])) >= 20 \
   and sum(imap(ord,b[3]))/float(len(b[3])) >= 20:

执行得更快而不是

ave1 = sum(imap(ord,a[3]))/float(len(a[3])) 
ave2 = sum(imap(ord,b[3]))/float(len(b[3]))
if ave1 >= 20 and ave2 >=20:

因为如果ave1不大于20，则不评估对象ave2 。

如何避免使用readlines（）？

问题描述

6 个解决方案

解决方案1
6 已采纳 2011-08-24 16:08:22

解决方案2
2 2011-08-24 16:26:47

解决方案3
1 2011-08-24 16:05:36

解决方案4
1 2011-08-24 16:16:31

解决方案5
0 2011-08-24 16:06:33

解决方案6
0 2011-08-24 22:16:26

如何避免使用readlines（）？

问题描述

6 个解决方案

解决方案1 6 已采纳 2011-08-24 16:08:22

解决方案2 2 2011-08-24 16:26:47

解决方案3 1 2011-08-24 16:05:36

解决方案4 1 2011-08-24 16:16:31

解决方案5 0 2011-08-24 16:06:33

解决方案6 0 2011-08-24 22:16:26

解决方案1
6 已采纳 2011-08-24 16:08:22

解决方案2
2 2011-08-24 16:26:47

解决方案3
1 2011-08-24 16:05:36

解决方案4
1 2011-08-24 16:16:31

解决方案5
0 2011-08-24 16:06:33

解决方案6
0 2011-08-24 22:16:26