Python加速硬盘上的随机磁盘读取

Question

i have a large set of files (each ~100kb) sitting on my HDD.我的硬盘上有大量文件（每个约 100kb）。 For each step of my algorithm, i need to randomly select and read in about 1000 files.对于我算法的每一步，我需要随机选择并读取大约 1000 个文件。 I use python and numpy.load to do this, and it is slow as heck.我使用 python 和 numpy.load 来做到这一点，而且速度很慢。

How can i speed this up?我怎样才能加快速度？ My intuition is that (except for bying a SSD), i could schedule all reads at once, and let the OS find an order which minimizes seek time.我的直觉是（除了通过 SSD），我可以一次安排所有读取，并让操作系统找到最小化寻道时间的顺序。 However, i'm not sure how to implement this in python.但是，我不确定如何在 python 中实现它。

maybe spwan 1000 threads, each of which performs a read?也许 spwan 1000 个线程，每个线程执行一次读取？
is there asynchronous numpy.load or equivalent?是否有异步 numpy.load 或等效的？

Any help is appreciated!任何帮助表示赞赏！ Thanks :)谢谢：）

Answer 1

A very simple example of the simple approach from my comment:我的评论中的简单方法的一个非常简单的例子：

from threading import Thread
import timeit
arr=[]
def populate(content):
    for i in content:
        arr.append(i)
content1=[i for i in range(1,10000000)]
content2=[i for i in range(10000000,20000001)]

thread1=Thread(target=populate,kwargs={'content':content1})
thread2=Thread(target=populate,kwargs={'content':content2})

start = timeit.default_timer()
thread1.start()
thread2.start()
thread1.join()
thread2.join()
stop = timeit.default_timer()

print("time taken:",stop-start)

It took 3.546857 seconds with threads and 4.7215586 seconds without threads有线程需要 3.546857 秒，没有线程需要 4.7215586 秒

So, there is a little speed up.所以，有一点加速。 You can add more threads to your problem and the times should improve further.您可以为您的问题添加更多线程，时间应该会进一步改善。

In your case, you will be loading a fraction of selected 1000 files with np.load and then appending this data to your main array在您的情况下，您将使用 np.load 加载选定的 1000 个文件的一小部分，然后将此数据附加到您的主数组

Python加速硬盘上的随机磁盘读取

问题描述

1 个解决方案

解决方案1
0 2020-08-24 19:15:46

Python加速硬盘上的随机磁盘读取

问题描述

1 个解决方案

解决方案1 0 2020-08-24 19:15:46

解决方案1
0 2020-08-24 19:15:46