简体   繁体   English

Python加速硬盘上的随机磁盘读取

[英]Python speed up random disk read on HDD

i have a large set of files (each ~100kb) sitting on my HDD.我的硬盘上有大量文件(每个约 100kb)。 For each step of my algorithm, i need to randomly select and read in about 1000 files.对于我算法的每一步,我需要随机选择并读取大约 1000 个文件。 I use python and numpy.load to do this, and it is slow as heck.我使用 python 和 numpy.load 来做到这一点,而且速度很慢。

How can i speed this up?我怎样才能加快速度? My intuition is that (except for bying a SSD), i could schedule all reads at once, and let the OS find an order which minimizes seek time.我的直觉是(除了通过 SSD),我可以一次安排所有读取,并让操作系统找到最小化寻道时间的顺序。 However, i'm not sure how to implement this in python.但是,我不确定如何在 python 中实现它。

  • maybe spwan 1000 threads, each of which performs a read?也许 spwan 1000 个线程,每个线程执行一次读取?
  • is there asynchronous numpy.load or equivalent?是否有异步 numpy.load 或等效的?

Any help is appreciated!任何帮助表示赞赏! Thanks :)谢谢 :)

A very simple example of the simple approach from my comment:我的评论中的简单方法的一个非常简单的例子:

from threading import Thread
import timeit
arr=[]
def populate(content):
    for i in content:
        arr.append(i)
content1=[i for i in range(1,10000000)]
content2=[i for i in range(10000000,20000001)]

thread1=Thread(target=populate,kwargs={'content':content1})
thread2=Thread(target=populate,kwargs={'content':content2})

start = timeit.default_timer()
thread1.start()
thread2.start()
thread1.join()
thread2.join()
stop = timeit.default_timer()

print("time taken:",stop-start)

It took 3.546857 seconds with threads and 4.7215586 seconds without threads有线程需要 3.546857 秒,没有线程需要 4.7215586 秒

So, there is a little speed up.所以,有一点加速。 You can add more threads to your problem and the times should improve further.您可以为您的问题添加更多线程,时间应该会进一步改善。

In your case, you will be loading a fraction of selected 1000 files with np.load and then appending this data to your main array在您的情况下,您将使用 np.load 加载选定的 1000 个文件的一小部分,然后将此数据附加到您的主数组

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM