简体   繁体   English

重新读取Python中文件的最快方法?

[英]Fastest way to re-read a file in Python?

I've got a file which has a list of names and their position(start - end). 我有一个文件,其中包含名称及其位置(开始-结束)的列表。

My script iterates over that file and per name it reads another file with info to check if that line is between those positions and then calculates something out of that. 我的脚本遍历该文件,并按名称读取另一个包含信息的文件,以检查该行是否在这些位置之间,然后从中计算出一些内容。

At the moment it reads the whole second file(60MB) line by line checking if it's between the start / end. 此刻,它逐行读取整个第二个文件(60MB),检查是否在开始/结束之间。 For every name in the first list(approx 5000). 对于第一个列表中的每个名称(大约5000)。 What's the fastest way to collect the data that's between those parameters instead of rereading the whole file 5000 times? 收集这些参数之间的数据而不是重新读取整个文件5000次的最快方法是什么?

Sample code of the second loop: 第二个循环的示例代码:

for line in file:
    if int(line.split()[2]) >= start and int(line.split()[2]) <= end:
        Dosomethingwithline():

EDIT: Loading the file in a list above the first loop and iterating over that improved the speed. 编辑:将文件加载到第一个循环上方的列表中,并进行迭代,以提高速度。

with open("filename.txt", 'r') as f:
    file2 = f.readlines()
for line in file:
    [...]
    for line2 in file2:
       [...]

You can use the mmap module to load that file into memory, then iterate. 您可以使用mmap模块将该文件加载到内存中,然后进行迭代。

Example: 例:

import mmap

# write a simple example file
with open("hello.txt", "wb") as f:
    f.write(b"Hello Python!\n")

with open("hello.txt", "r+b") as f:
    # memory-map the file, size 0 means whole file
    mm = mmap.mmap(f.fileno(), 0)
    # read content via standard file methods
    print(mm.readline())  # prints b"Hello Python!\n"
    # read content via slice notation
    print(mm[:5])  # prints b"Hello"
    # update content using slice notation;
    # note that new content must have same size
    mm[6:] = b" world!\n"
    # ... and read again using standard file methods
    mm.seek(0)
    print(mm.readline())  # prints b"Hello  world!\n"
    # close the map
    mm.close()

Maybe switch your loops around? 也许切换循环? Make iterating over the file the outer loop, and iterating over the name list the inner loop. 使遍历文件成为外部循环,并使遍历名称列表成为内部循环。

name_and_positions = [
    ("name_a", 10, 45),
    ("name_b", 2, 500),
    ("name_c", 96, 243),
]

with open("somefile.txt") as f:
    for line in f:
        value = int(line.split()[2])
        for name, start, end in name_and_positions:
            if start <= value <= end:
                print("matched {} with {}".format(name, value))

It seems to me that your problem is not so much re-reading files, but matching slices of a long list with a short list. 在我看来,您的问题不是重新读取文件,而是将长列表的与短列表匹配。 As other answers have pointed out, you can use plain lists or memory-mapped files to speed up your program. 正如其他答案所指出的,您可以使用纯列表或内存映射文件来加快程序速度。

If you care to use specific data structures for further speed up, then I would advise you to look into blist , specifically because it has a better performance in slicing lists than the standard Python list: they claim O(log n) instead of O(n) . 如果你愿意用具体的数据结构进一步加快,那么我劝你寻找到blist ,特别是因为它在切片列表比标准Python列表有更好的表现:他们声称O(log n)的 ,而不是O( n)

I have measured a speedup of almost 4x on lists of ~10MB: 我测得的速度在大约10MB的列表上几乎提高了4倍:

import random

from blist import blist

LINE_NUMBER = 1000000


def write_files(line_length=LINE_NUMBER):
    with open('haystack.txt', 'w') as infile:
        for _ in range(line_length):
            infile.write('an example\n')

    with open('needles.txt', 'w') as infile:
        for _ in range(line_length / 100):
            first_rand = random.randint(0, line_length)
            second_rand = random.randint(first_rand, line_length)
            needle = random.choice(['an example', 'a sample'])
            infile.write('%s\t%s\t%s\n' % (needle, first_rand, second_rand))


def read_files():
    with open('haystack.txt', 'r') as infile:
        normal_list = []
        for line in infile:
            normal_list.append(line.strip())

    enhanced_list = blist(normal_list)
    return normal_list, enhanced_list


def match_over(list_structure):
    matches = 0
    total = len(list_structure)
    with open('needles.txt', 'r') as infile:
        for line in infile:
            needle, start, end = line.split('\t')
            start, end = int(start), int(end)
            if needle in list_structure[start:end]:
                matches += 1
    return float(matches) / float(total)

As measured by IPython's %time command, the blist takes 12 s where the plain list takes 46 s: 根据IPython的%time命令测量, blist花费12 s,而普通list花费46 s:

In [1]: import main

In [3]: main.write_files()

In [4]: !ls -lh *.txt
10M haystack.txt
233K needles.txt

In [5]: normal_list, enhanced_list = main.read_files()

In [8]: %time main.match_over(normal_list)
CPU times: user 44.9 s, sys: 1.47 s, total: 46.4 s
Wall time: 46.4 s
Out[8]: 0.005032

In [9]: %time main.match_over(enhanced_list)
CPU times: user 12.6 s, sys: 33.7 ms, total: 12.6 s
Wall time: 12.6 s
Out[9]: 0.005032

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM