简体   繁体   中英

write specific lines from a file python

I have two files, in one of which I have a list of loci ( Loci.txt ) (about 16 million to be exact) and in a second file I have a list of line numbers ( Pos.txt ). What I want to do is write only the lines from the Loci.txt that are specified in the Pos.txt file to a new file. Below is a truncated version of the two files:

Loci.txt

R000001 1
R000001 2
R000001 3
R000001 4
R000001 5
R000001 6
R000001 7
R000001 8
R000001 9
R000001 10

Pos.txt

1
3
5
9
10

Here is the code I have written for the task

#!/usr/bin/env python

import os
import sys

F1 = sys.argv[1]
F2 = sys.argv[2]
F3 = sys.argv[3]

File1 = open(F1).readlines()
File2 = open(F2).readlines()
File3 = open(F3, 'w')
Lines = []

for line in File1:
    Lines.append(int(line))

for i, line in enumerate(File2):
    if i+1 in Lines:
        File3.write(line)

The code works exactly like I want it to and the output looks like this

OUT.txt

R000001 1
R000001 3
R000001 5
R000001 9
R000001 10

The problem is that when I apply this to my whole data set where I have to pull some 13 million lines from a file containing 16 million lines it takes forever to complete. Is there anyway I can write this code so that it will run faster?

You code is slow mostly because you are searching in a list if the line you have have to be printed : if i+1 in Lines . Each time your programs scans the full list to find if the line number is in or not.
You can replace:

Lines = []

for line in File1:
    Lines.append(int(line))

By:

Lines = {}

for line in File1:
    Lines[int(line)] = True

You could try something like this:

import sys

F1 = sys.argv[1]
F2 = sys.argv[2]
F3 = sys.argv[3]

File1 = open(F1)
File2 = open(F2)
File3 = open(F3, 'w')

for linenumber in File2:
    for line in File1:
        if linenumber in line:
            File3.write(line)
            break

This might look terrible due to the nested for-loops, but since we are iterating over the lines of a file, the script will simply continue from where it left off when the last line was discovered. This is because of how reading of files work, where a pointer is used to keep track of your location in the file. In order to read from the beginning of the file again, you would have to use the seek function to move the pointer to the file's start.

You can try with this code :

#!/usr/bin/env python

with open("loci.txt") as File1:
    lociDic = {int(line.split()[1]): line.split()[0] for line in File1}

with open("pos.txt") as File2:
    with open("result.txt", 'w') as File3:
        for line in File2:
            if int(line) in lociDic:
                File3.write(' '.join([lociDic[int(line)], line]))

Key points in this solution are:

  1. Create enumerate in the first step (a dictionary is used)
  2. Avoid to read entire File2 at once (using with statement)

Also I use integers (code) contained in File1 and File2 because I suppose there is a possibility of holes in File1 sequence. Other solutions are possible otherwise.

As others have mentioned, reading the entire file in memory first is what is causing the problem(s). Here is an alternative approach, which scans the large file and writes out only those lines that match.

with open('search_keys.txt', 'r') as f:
    filtered_keys = [line.rstrip() for line in f]

with open('large_file.txt', 'r') as haystack, open('output.txt', 'w') as results:
    for line in haystack:
        if len(line.strip()):  #  This to skip blanks
            if line.split()[1] in filtered_keys:
                results.write('{}\n'.format(line))

This way you only read the big file one line at a time and write out the results at the same time.

Keep in mind that this won't sort the output.

If your search_keys.txt file is very large, converting filtered_keys to a set will improve look up times.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM