简体   繁体   English

使用python提高读写txt文件的性能

[英]Improve performance reading and writing txt files with python

I am using python to manage txt files, read and write several times.我使用python管理txt文件,读写多次。 But if I have 1000 txt files my code is taking so long time to manage the files.但是如果我有 1000 个 txt 文件,我的代码需要很长时间来管理这些文件。

How can I improve the performance to manage with these files?如何提高管理这些文件的性能?

I have files with this information:我有包含此信息的文件:

position   temperature
0,0        30,10
1,0        45,12
2,0        20,45 (...)

In the first place I need to remove the lines with strings.首先,我需要删除带字符串的行。 For taht I search the strings and create new txt file and copy the information without the lines with strings to that new txt file.对于 taht 我搜索字符串并创建新的 txt 文件并将没有带字符串的行的信息复制到新的 txt 文件中。 I get this:我明白了:

0,0       30,10
1,0       45,12
2,0       20,45 (...)

Then I replace the , by .然后我将 , 替换为 。 in all the files, again creating a new file and coping the information with points to these new files.在所有文件中,再次创建一个新文件并处理指向这些新文件的信息。 I get this:我明白了:

0.0      30.10
1.0      45.12
2.0      20.45 (...)

Then I need to restrict the information with a minimal position value (a) and a maximum value (b).然后我需要用最小位置值 (a) 和最大值 (b) 来限制信息。 So in the first column I just want the lines between a and b.所以在第一列中,我只想要 a 和 b 之间的线。 Again I create new files and copy the information that I want to these files.我再次创建新文件并将我想要的信息复制到这些文件中。

Finally I add a new column to each file with some information.最后,我向每个文件添加了一个新列,其中包含一些信息。

So, I think that the time consuming in my code is related to the times that I create new files, copy the information and replace the old ones with these new ones.所以,我认为我的代码中的耗时与我创建新文件、复制信息和用这些新文件替换旧文件的时间有关。

Maybe this all can be done in one step.也许这一切都可以一步完成。 I just need to know if it is possible.我只需要知道是否有可能。

Thanks谢谢

You no need to create new files just to delete first lines or replace commas to dots in lines.您无需创建新文件只是为了删除第一行或将逗号替换为行中的点。 You could do all work in memory, ie read data from file, replace commas to dots, convert values to float, sort them, trim min and max values and write result to file, like this:您可以在内存中完成所有工作,即从文件中读取数据,将逗号替换为点,将值转换为浮点数,对它们进行排序,修剪最小值和最大值并将结果写入文件,如下所示:

input_file = open('input_file', 'r')
data = []
input_file.readline() # first line with titles
for line in input_file: # lines with data
    data.append(map(lambda x: float(x.replace(',', '.'), line.split()))
input_file.close()

data.sort(key=lambda x: x[1])
data = data[1:-1]

result_file = open('result_file', 'w')
result_file.writelines(['\t'.join(row) for row in data])

result_file.close()

You're definitely doing it the hard way... There are a lot of detail informations missing from your problem's description, but assuming a few things - like the headers are always on the first line, "position" and "temperature" are supposed to be floats etc - here's a code example doing mostly what you describe in a single pass:你肯定是在以艰难的方式做这件事......你的问题描述中缺少很多细节信息,但假设有一些事情 - 比如标题总是在第一行,“位置”和“温度”应该是成为浮点数等 - 这是一个代码示例,主要完成您在单次传递中描述的内容:

import sys
from itertools import ifilter

def parse(path):
    with open(path) as f:
        # skip the headers
        f.next()

        # parse data
        for lineno, line in enumerate(f, 1):
            try:
                position, temperature = line.split()
                position = float(position.replace(",", "."))
                temperature = float(temperature.replace(",","."))
            except ValueError as e:
                raise ValueError("Invalid line at %s:#%s" % (path, lineno))

            yield position, temperature


def process(inpath, outpath, minpos, maxpos):
    # warning: exact comparisons on floating points numbers
    # are not safe
    filterpos = lambda r: minpos <= r[0] <= maxpos 

    with open(outpath, "w") as outfile:
        for pos, temp in ifilter(filterpos, parse(inpath)):
            extra_col = compute_something_from(pos, temp)
            out.writeline("{}\t{}\t{}\t{}\n".format(pos, temp, extra_col))


def compute_something_from(pos, temp):
    # anything
    return pos * (temp / 3.)


def main(*args):
    # TODO : 
    # - clean options / args handling
    # - outfile naming ? 
    minpos = float(args[0])
    maxpos = float(args[1])
    for inpath in args[2:]:
        outpath = inpath + ".tmp"
        process(inpath, outpath, minpos, maxpos)

if __name__ == "__main__":
    main(*sys.argv[1:])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM