简体   繁体   English

有没有更好的方法来读取 Python 中的几个 txt 文件?

[英]Is there a better way to read several txt files in Python?

I have several files with aprox 1M rows.我有几个大约 1M 行的文件。

This is a sample of the file content:这是文件内容的示例:

begin(model(tb4)).
...
sequence_length(187).
amino_acid_pair_ratio(a,a,24.8).
amino_acid_pair_ratio(a,c,0.0).
...
tb_to_tb_evalue(tb3671,1.100000e-01).
tb_to_tb_evalue(tb405,4.300000e-01).
tb_to_tb_evalue(tb3225,5.600000e-01).
...
end(model(tb4))
begin(model(tb56)).
......
end(model(tb56))

By having an input like通过输入像

myarray = (tb4, tb56..)

I need to calculate how many lines of the type "tb_to_tb_evalue" are contained in each model.我需要计算每个 model 中包含多少行类型为“tb_to_tb_evalue”的行。

In this case, with the sample text, the output should be: tb4 = 3 tb56 = 0在这种情况下,使用示例文本,output 应该是: tb4 = 3 tb56 = 0

I have done this so far but I realized I will have to read all file as many times as len(myarray)到目前为止我已经这样做了,但我意识到我必须像 len(myarray) 一样多次读取所有文件

def readorfs():
    # Declaramos la ruta de la carpeta que almacena los ficheros
    path = "data/orfs"
    # Recogemos los nombres de los ficheros
    all_files = glob.glob(path + "/*.txt")
    # Leemos los ficheros line a linea
    for filename in all_files:
        with open(filename) as f:
            lines = f.readlines()  # Lee el fichero line a linea
            for line in lines:
                if line.startswith("begin(model(") and (myarray[i]) in line:
                    print(line)

Here is my suggestion: Firstly create a dictionary with all items of myarray as keys and values=0 Then, create a function that will handle a specific file.这是我的建议:首先创建一个字典,其中 myarray 的所有项作为键和值 = 0 然后,创建一个 function 来处理特定文件。 Load the entire file as text, split it by '(begin(model(' and count all occurencies of 'tb_to_tb_evalue'. Add all results to the dictionary. Finally run this function for all files. See below:将整个文件加载为文本,将其拆分为 '(begin(model(' 并计算所有出现的 'tb_to_tb_evalue'。将所有结果添加到字典中。最后对所有文件运行此 function。见下文:

d={i:0 for i in myarray}
def readorfs(file):
    t=open(file).read()
    l=t.split(sep='begin(model(')[1:]
    for i in l:
        s=i[:i.find(')')]
        if s in d:
            d[s]+=i.count('tb_to_tb_evalue')
        else:
            d[s]=i.count('tb_to_tb_evalue')

for filename in all_files:
    readorfs(filename)

You can also run all files in one function as below.您也可以在一个 function 中运行所有文件,如下所示。 In this case, you must insert myarray in function parameter:在这种情况下,您必须在 function 参数中插入 myarray:

def readorfs(myarray):
    d={i:0 for i in myarray}
    path = "data/orfs"
    all_files = glob.glob(path + "/*.txt")
    for filename in all_files:
        t=open(file).read()
        l=t.split(sep='begin(model(')[1:]
        for i in l:
            s=i[:i.find(')')]
            if s in d:
                d[s]+=i.count('tb_to_tb_evalue')
            else:
                d[s]=i.count('tb_to_tb_evalue')
    return d

Here is a suggestion using a defaultdict from collections which will get your specific model name as the key and the count of tb_to_tb_evalue lines for that model as the value.这是使用来自collectionsdefaultdict的建议,它将获取您的特定 model 名称作为键,并将该 model 的tb_to_tb_evalue行的计数作为值。 Because you're reading all the files completely, there is no real extra overhead to find the counts of all the models.因为您正在完全阅读所有文件,所以查找所有模型的计数并没有真正的额外开销。 But it would be straightforward in the end to get the counts for your specific models from a list.但最终从列表中获取特定模型的计数会很简单。

from collections import defaultdict
import re
tb_count = defaultdict(int)
# create regular expression to find the model name from the "begin(model(...))" lines
model_regex = re.compile(r"begin\(model\((.*)\)\)")
for file in all_filenames:
    model = None  # initiate model as None for each file, but value will be changed if begin(model( line is encountered
    with open(file) as f:
        for line in f:
            if line.startswith("begin(model("):
                # identify the model name
                match = model_regex.search(line)
                if match:
                    model = match.group(1)
            if line.startswith("tb_to_tb_evalue("):
                tb_count[model] += 1  # increase the count for the current active model

So it will iterate through all your files but only once .所以它会遍历你的所有文件,但只会遍历一次 At the end to get the count of all the models from a specific list (eg myarray ), you can write something like:最后要从特定列表(例如myarray )中获取所有模型的计数,您可以编写如下内容:

models_of_interest = {k: v for k, v in tb_count.items() if k in myarray }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM