简体   繁体   English

Python-如何读取文本文件中的特定行?

[英]Python - How to read a specific line in a text file?

I have a huge text file (12GB). 我有一个巨大的文本文件(12GB)。 The lines are tab delimited and the first column contains an ID. 这些行用制表符分隔,第一列包含一个ID。 For each ID I want to do something. 对于每个ID,我都想做些事情。 Therefore, my plan is to go start with the first line, go through the first column line by line until the next ID is reached. 因此,我的计划是从第一行开始,逐行遍历第一列,直到到达下一个ID。

start_line = b
num_lines = 377763316

while b < num_lines:
  plasmid1 = linecache.getline("Result.txt", b-1)
  plasmid1 = plasmid1.strip("\n")
  plasmid1 = plasmid1.split("\t")

  plasmid2 = linecache.getline("Result.txt", b)
  plasmid2 = plasmid2.strip("\n")
  plasmid2 = plasmid2.split("\t")


    if not str(plasmid1[0]) == str(plasmid2[0]):
      end_line = b
      #do something

The code works, but the problem is that linecache seems to reload the txt-file every time. 该代码可以工作,但是问题是线路缓存似乎每次都重新加载txt文件。 The code would run several years if I don't increase the performance. 如果不提高性能,该代码将运行几年。

I appreciate your help if you have a good idea how to solve the issue or know an alternative approach! 如果您有个好主意如何解决问题或知道替代方法,我们将不胜感激!

Thanks, Philipp 谢谢菲利普

You should open the file just once, and iterate over the lines. 您应该只打开文件一次,然后遍历各行。

with open('Result.txt', 'r') as f:
    aline = f.next()
    currentid = aline.split('\t', 1)[0]
    for nextline in f:
        nextid = nextline.split('\t', 1)[0]
        if nextid != currentid:
            #do stuff
            currentid = nextid

You get the idea, just use plain python. 您有主意,只需使用普通python。 Only one line is read in each iteration. 每次迭代仅读取一行。 The extra 1 argument in the split will split only to the first tab, encreasing performance. 拆分中多余的1参数将仅拆分到第一个选项卡,从而提高了性能。 You will not get better performance with any specialized library. 使用任何专用库都不会获得更好的性能。 Only a plain C language implementation could beat this approach. 只有普通的C语言实现可以击败这种方法。

If you get the AttributeError: '_io.TextIOWrapper' object has , it is probably because you are using Python 3.X (see question io-textiowrapper-object ). 如果得到AttributeError: '_io.TextIOWrapper' object has ,则可能是因为您使用的是Python 3.X(请参阅io-textiowrapper-object问题 )。 Try this version instead: 试试这个版本:

with open('Result.txt', 'r') as f:
    aline = f.readline()
    currentid = aline.split('\t', 1)[0]
    while aline != '':
        aline = f.readline()
        nextid = aline.split('\t', 1)[0]
        if nextid != currentid:
            #do stuff
            currentid = nextid

I think numpy.loadtxt() is the way to go. 我认为numpy.loadtxt()是要走的路。 Also it would be nice to pass usecols argument to specify which columns you actually need from the file. 同样,传递usecols参数来指定您实际上需要从文件中获取哪些列也将是很好的。 Numpy package is solid library written with high performance in mind. Numpy软件包是考虑到高性能而编写的可靠库。

After calling loadtxt() you will get ndarray back. 调用loadtxt()您将返回ndarray

You can use itertools: 您可以使用itertools:

from itertools import takewhile

class EqualityChecker(object):
   def __init__(self, id):
       self.id = id

   def __call__(self, current_line):
       result = False
       current_id = current_line.split('\t')[0]

       if self.id == current_id:
           result = True

       return result


with open('hugefile.txt', 'r') as f:
   for id in ids:
       checker = EqualityChecker(id)
       for line in takewhile(checker, f.xreadlines()):
           do_stuff(line) 

In outer loop id can actually be obtain from the first line with an id non-matching previous value. 在外环id实际上可以从与ID不匹配的先前值的第一行得到。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM