简体   繁体   中英

Is there a better way to read several txt files in Python?

I have several files with aprox 1M rows.

This is a sample of the file content:

begin(model(tb4)).
...
sequence_length(187).
amino_acid_pair_ratio(a,a,24.8).
amino_acid_pair_ratio(a,c,0.0).
...
tb_to_tb_evalue(tb3671,1.100000e-01).
tb_to_tb_evalue(tb405,4.300000e-01).
tb_to_tb_evalue(tb3225,5.600000e-01).
...
end(model(tb4))
begin(model(tb56)).
......
end(model(tb56))

By having an input like

myarray = (tb4, tb56..)

I need to calculate how many lines of the type "tb_to_tb_evalue" are contained in each model.

In this case, with the sample text, the output should be: tb4 = 3 tb56 = 0

I have done this so far but I realized I will have to read all file as many times as len(myarray)

def readorfs():
    # Declaramos la ruta de la carpeta que almacena los ficheros
    path = "data/orfs"
    # Recogemos los nombres de los ficheros
    all_files = glob.glob(path + "/*.txt")
    # Leemos los ficheros line a linea
    for filename in all_files:
        with open(filename) as f:
            lines = f.readlines()  # Lee el fichero line a linea
            for line in lines:
                if line.startswith("begin(model(") and (myarray[i]) in line:
                    print(line)

Here is my suggestion: Firstly create a dictionary with all items of myarray as keys and values=0 Then, create a function that will handle a specific file. Load the entire file as text, split it by '(begin(model(' and count all occurencies of 'tb_to_tb_evalue'. Add all results to the dictionary. Finally run this function for all files. See below:

d={i:0 for i in myarray}
def readorfs(file):
    t=open(file).read()
    l=t.split(sep='begin(model(')[1:]
    for i in l:
        s=i[:i.find(')')]
        if s in d:
            d[s]+=i.count('tb_to_tb_evalue')
        else:
            d[s]=i.count('tb_to_tb_evalue')

for filename in all_files:
    readorfs(filename)

You can also run all files in one function as below. In this case, you must insert myarray in function parameter:

def readorfs(myarray):
    d={i:0 for i in myarray}
    path = "data/orfs"
    all_files = glob.glob(path + "/*.txt")
    for filename in all_files:
        t=open(file).read()
        l=t.split(sep='begin(model(')[1:]
        for i in l:
            s=i[:i.find(')')]
            if s in d:
                d[s]+=i.count('tb_to_tb_evalue')
            else:
                d[s]=i.count('tb_to_tb_evalue')
    return d

Here is a suggestion using a defaultdict from collections which will get your specific model name as the key and the count of tb_to_tb_evalue lines for that model as the value. Because you're reading all the files completely, there is no real extra overhead to find the counts of all the models. But it would be straightforward in the end to get the counts for your specific models from a list.

from collections import defaultdict
import re
tb_count = defaultdict(int)
# create regular expression to find the model name from the "begin(model(...))" lines
model_regex = re.compile(r"begin\(model\((.*)\)\)")
for file in all_filenames:
    model = None  # initiate model as None for each file, but value will be changed if begin(model( line is encountered
    with open(file) as f:
        for line in f:
            if line.startswith("begin(model("):
                # identify the model name
                match = model_regex.search(line)
                if match:
                    model = match.group(1)
            if line.startswith("tb_to_tb_evalue("):
                tb_count[model] += 1  # increase the count for the current active model

So it will iterate through all your files but only once . At the end to get the count of all the models from a specific list (eg myarray ), you can write something like:

models_of_interest = {k: v for k, v in tb_count.items() if k in myarray }

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM