使用python有效地迭代一个巨大的循环

Question

I have 100000 images and I need to get the vectors for each image 我有100000张图片，我需要为每张图片获取矢量

imageVectors = []
for i in range(100000):
    fileName = "Images/" + str(i) + '.jpg'
    imageVectors.append(getvector(fileName).reshape((1,2048)))
cPickle.dump( imageVectors, open( 'imageVectors.pkl', "w+b" ), cPickle.HIGHEST_PROTOCOL )

getVector is a function that takes 1 image at a time and takes about 1 second to process a it. getVector是一个一次获取1个图像并需要大约1秒来处理它的函数。 So, basically my problem reduces to 所以，基本上我的问题减少到了

for i in range(100000):
    A = callFunction(i)  //a complex function that takes 1 sec for each call

The things that I have already tried are: ( only the pseduo-code is given here ) 我已经尝试过的东西是:( 这里只给出了pseduo代码 ）

1) Using numpy vectorizer: 1）使用numpy矢量化器：

def callFunction1(i):
   return callFunction2(i)
vfunc = np.vectorize(callFunction1)
imageVectors = vfunc(list(range(100000))

2)Using python map: 2）使用python map：

def callFunction1(i):
    return callFunction2(i)
imageVectors = map(callFunction1, list(range(100000))

3) Using python multiprocessing: 3）使用python多处理：

import multiprocessing
try:
   cpus = multiprocessing.cpu_count()
except NotImplementedError:
   cpus = 4   # arbitrary default

pool = multiprocessing.Pool(processes=cpus)
result = pool.map(callFunction, xrange(100000000))

4) Using multiprocessing in a different way: 4）以不同的方式使用多处理：

from multiprocessing import Process, Queue
q = Queue()
N = 100000000
p1 = Process(target=callFunction, args=(N/4,q))
p1.start()
p2 = Process(target=callFunction, args=(N/4,q))
p2.start()
p3 = Process(target=callFunction, args=(N/4,q))
p3.start()
p4 = Process(target=callFunction, args=(N/4,q))
p4.start()

results = []
for i in range(4):
    results.append(q.get(True))
p1.join()
p2.join()
p3.join()
p4.join()

All the above methods are taking immensely huge time. 以上所有方法都花费了巨大的时间。 Is there any other way more efficient than this so that maybe I can loop through many elements simultaneously instead of sequentially or in any other faster way. 有没有比这更有效的其他方式，以便我可以同时循环通过许多元素而不是顺序或以任何其他更快的方式。

The time is mainly being taken by the getvector function itself. 时间主要由getvector函数本身getvector 。 As a work around, I have split my data into 8 different batches and running the same program for different parts of the loop and running eight separate instances of python on a octa-core VM in google cloud. 作为一种解决方法，我将数据分成8个不同的批次，并为循环的不同部分运行相同的程序，并在谷歌云中的八核VM上运行八个单独的python实例。 Could anyone suggest if map-reduce or taking help of GPU's using PyCuda may be a good option? 任何人都可以建议使用PyCuda映射减少或使用GPU的帮助可能是一个不错的选择吗？

Answer 1

The multiprocessing.Pool solution is a good one, in the sense that it uses all your cores. multiprocessing.Pool解决方案是一个很好的解决方案，因为它使用了所有核心。 So it should be approximately N times faster than using plain old map, where N is the number of cores you have. 所以它应该比使用普通旧地图快大约N倍，其中N是你拥有的核心数量。

BTW, you can skip determining the amount of cores. 顺便说一下，您可以跳过确定核心数量。 By default multiprocessing.Pool uses as many processes as your CPU has cores. 默认情况下， multiprocessing.Pool使用与CPU具有核心一样多的进程。

Instead of a plain map (which blocks until everything has been processed), I would suggest using imap_unordered . 我建议使用imap_unordered而不是普通map （在所有处理imap_unordered 。 This is an iterator that will start returning results as soon as they become available so your parent process can start further processing if any. 这是一个迭代器，它将在可用时立即开始返回结果，因此您的父进程可以开始进一步处理（如果有的话）。 If ordering is important, you might want to return a tuple (number, array) to identify the result. 如果排序很重要，您可能希望返回一个元组（数字，数组）来标识结果。

Your function returns a numpy array of 2048 values, which I assume are numpy.float64 Using the standard mapping functions will transport the results back to the parent process using IPC. 您的函数返回一个2048个值的numpy数组，我假设它是numpy.float64使用标准映射函数将使用IPC将结果传输回父进程。 On a 4-core machine that will result in 4 IPC transports of 2048*8 = 16384 bytes, so 65536 bytes/second. 在4核机器上，将导致4个IPC传输2048 * 8 = 16384字节，因此65536字节/秒。 That doesn't sound too bad. 这听起来不太糟糕。 But I don't know how much overhead the IPC (which involves pickling and Queues) will incur. 但我不知道IPC（涉及酸洗和队列）将产生多少开销。

In case the overhead is large, you might want to create a shared memory area to store the results in. You would need approximately 1.5 Gib to store 100000 results of 2048 8-byte floats. 如果开销很大，您可能需要创建一个共享内存区域来存储结果。您需要大约1.5 Gib来存储2048个8字节浮点数的100000个结果。 That is a sizeable amount of memory, but not impractical for current machines. 这是相当大的内存量，但对于当前的机器来说并不实用。

For 100000 images and 4 cores and each image taking around one second, your program's running time would be in the order of 8 hours. 对于100000个图像和4个核心，每个图像大约需要一秒钟，您的程序的运行时间大约为8小时。

Your most important task for optimization would be to look into reducing the runtime of the getvector function. 您最重要的优化任务是研究减少getvector函数的运行时间。 For example, would it run just as well if you reduced the size of the images by half? 例如，如果将图像的大小减半，它会运行吗？ Assuming that the runtime scales linearly to the amount of pixels, that should cut the runtime to 0.25 s. 假设运行时线性扩展到像素数量，那应该将运行时间缩短到0.25秒。

使用python有效地迭代一个巨大的循环

问题描述

1 个解决方案

解决方案1
2 2017-04-24 09:28:25

使用python有效地迭代一个巨大的循环

问题描述

1 个解决方案

解决方案1 2 2017-04-24 09:28:25

解决方案1
2 2017-04-24 09:28:25