简体   繁体   English

从文件python写入特定行

[英]write specific lines from a file python

I have two files, in one of which I have a list of loci ( Loci.txt ) (about 16 million to be exact) and in a second file I have a list of line numbers ( Pos.txt ). 我有两个文件,在其中一个文件中,我有一个基因座列表( Loci.txt )( Loci.txt约为1600万个),在第二个文件中,我有一个行号列表( Pos.txt )。 What I want to do is write only the lines from the Loci.txt that are specified in the Pos.txt file to a new file. 我要做的是仅将Pos.txt文件中指定的Loci.txt中的行写入新文件。 Below is a truncated version of the two files: 下面是两个文件的截断版本:

Loci.txt Loci.txt

R000001 1
R000001 2
R000001 3
R000001 4
R000001 5
R000001 6
R000001 7
R000001 8
R000001 9
R000001 10

Pos.txt Pos.txt

1
3
5
9
10

Here is the code I have written for the task 这是我为该任务编写的代码

#!/usr/bin/env python

import os
import sys

F1 = sys.argv[1]
F2 = sys.argv[2]
F3 = sys.argv[3]

File1 = open(F1).readlines()
File2 = open(F2).readlines()
File3 = open(F3, 'w')
Lines = []

for line in File1:
    Lines.append(int(line))

for i, line in enumerate(File2):
    if i+1 in Lines:
        File3.write(line)

The code works exactly like I want it to and the output looks like this 该代码的工作方式与我想要的完全一样,并且输出看起来像这样

OUT.txt OUT.txt

R000001 1
R000001 3
R000001 5
R000001 9
R000001 10

The problem is that when I apply this to my whole data set where I have to pull some 13 million lines from a file containing 16 million lines it takes forever to complete. 问题是,当我将其应用于整个数据集时,我不得不从包含1600万行的文件中提取大约1300万行,这将永远需要完成。 Is there anyway I can write this code so that it will run faster? 无论如何,我可以编写这段代码以使其运行得更快吗?

You code is slow mostly because you are searching in a list if the line you have have to be printed : if i+1 in Lines . 您编写代码的速度很慢,主要是因为您在列表中搜索是否必须打印行: if i+1 in Lines Each time your programs scans the full list to find if the line number is in or not. 每次程序扫描完整列表以查找行号是否存在时。
You can replace: 您可以替换:

Lines = []

for line in File1:
    Lines.append(int(line))

By: 通过:

Lines = {}

for line in File1:
    Lines[int(line)] = True

You could try something like this: 您可以尝试这样的事情:

import sys

F1 = sys.argv[1]
F2 = sys.argv[2]
F3 = sys.argv[3]

File1 = open(F1)
File2 = open(F2)
File3 = open(F3, 'w')

for linenumber in File2:
    for line in File1:
        if linenumber in line:
            File3.write(line)
            break

This might look terrible due to the nested for-loops, but since we are iterating over the lines of a file, the script will simply continue from where it left off when the last line was discovered. 由于嵌套的for循环,这可能看起来很糟糕,但是由于我们要遍历文件的各行,因此该脚本将简单地从发现最后一行时停止的地方继续。 This is because of how reading of files work, where a pointer is used to keep track of your location in the file. 这是因为文件的读取方式如何工作,其中使用了指针来跟踪文件中的位置。 In order to read from the beginning of the file again, you would have to use the seek function to move the pointer to the file's start. 为了再次从文件的开头读取,您必须使用seek函数将指针移动到文件的开头。

You can try with this code : 您可以尝试使用以下代码:

#!/usr/bin/env python

with open("loci.txt") as File1:
    lociDic = {int(line.split()[1]): line.split()[0] for line in File1}

with open("pos.txt") as File2:
    with open("result.txt", 'w') as File3:
        for line in File2:
            if int(line) in lociDic:
                File3.write(' '.join([lociDic[int(line)], line]))

Key points in this solution are: 该解决方案的关键点是:

  1. Create enumerate in the first step (a dictionary is used) 在第一步中创建枚举(使用字典)
  2. Avoid to read entire File2 at once (using with statement) 避免一次读取整个File2(与语句一起使用)

Also I use integers (code) contained in File1 and File2 because I suppose there is a possibility of holes in File1 sequence. 另外,我使用File1和File2中包含的整数(代码),因为我认为File1序列中可能有空洞。 Other solutions are possible otherwise. 否则,其他解决方案也是可能的。

As others have mentioned, reading the entire file in memory first is what is causing the problem(s). 正如其他人提到的那样,首先导致读取整个文件的内存是导致问题的原因。 Here is an alternative approach, which scans the large file and writes out only those lines that match. 这是另一种方法,它扫描大文件并只写出匹配的那些行。

with open('search_keys.txt', 'r') as f:
    filtered_keys = [line.rstrip() for line in f]

with open('large_file.txt', 'r') as haystack, open('output.txt', 'w') as results:
    for line in haystack:
        if len(line.strip()):  #  This to skip blanks
            if line.split()[1] in filtered_keys:
                results.write('{}\n'.format(line))

This way you only read the big file one line at a time and write out the results at the same time. 这样,您一次只读取一行大文件并同时写出结果。

Keep in mind that this won't sort the output. 请记住,这不会对输出进行排序。

If your search_keys.txt file is very large, converting filtered_keys to a set will improve look up times. 如果您search_keys.txt文件非常大,转换filtered_keysset将提高查找时间。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM