简体   繁体   English

python,如何增量创建Threads

[英]python, how to incrementally create Threads

I have a list of items aprox 60,000 items - i would like to send queries to the database to check if they exist and if they do return some computed results.我有一个大约 60,000 个项目的列表——我想向数据库发送查询以检查它们是否存在以及它们是否返回一些计算结果。 I run an ordinary query, while iterating through the list one-by-one, the query has been running for the last 4 days.我运行一个普通查询,同时逐一遍历列表,查询已经运行了最后 4 天。 I thought i could use the threading module to improve on this.我想我可以使用线程模块来改进这一点。 I did something like this我做了这样的事情

if __name__ == '__main__':
  for ra, dec in candidates:
    t = threading.Thread(target=search_sl, args=(ra,dec, q))
    t.start()
  t.join()

I tested with only 10 items and it worked fine - when i submitted the whole list of 60k items, i run into errors ie, "maximum number of sessions exceeded".我只测试了 10 个项目并且工作正常 - 当我提交整个 60k 项目列表时,我遇到了错误,即“超出最大会话数”。 What I want to do is to create maybe 10 thread at a time.我想做的是一次创建 10 个线程。 When the 1st bunch of thread have finished excuting, i send another request and so on.当第一束线程完成执行时,我发送另一个请求,依此类推。

You could try using a process pool, which is available in the multiprocessing module.您可以尝试使用多处理模块中提供的进程池。 Here is the example from the python docs:这是 python 文档中的示例:

from multiprocessing import Pool

def f(x):
    return x*x

if __name__ == '__main__':
    pool = Pool(processes=4)              # start 4 worker processes
    result = pool.apply_async(f, [10])    # evaluate "f(10)" asynchronously
    print result.get(timeout=1)           # prints "100" unless your computer is *very* slow
    print pool.map(f, range(10))          # prints "[0, 1, 4,..., 81]"

http://docs.python.org/library/multiprocessing.html#using-a-pool-of-workers http://docs.python.org/library/multiprocessing.html#using-a-pool-of-workers

Try increasing the number of processes until you reach the maximum your system can support.尝试增加进程数,直到达到系统可以支持的最大值。

Improve your queries before threading (premature optimization is the root of all evil!)在线程化之前改进您的查询(过早的优化是万恶之源!)

Your problem is having 60,000 different queries on a single database.您的问题是在一个数据库上有 60,000 个不同的查询。 Having a single query for each item means a lot of overhead for opening the connection and invoking a DB cursor session.对每个项目进行单个查询意味着打开连接和调用 DB cursor session 的大量开销。

Threading those queries can speed up your process, but yields another set of problems like DB overload and max sessions allowed.线程化这些查询可以加快您的进程,但会产生另一组问题,例如数据库过载和允许的最大会话数。

First approach: Load many item IDs into every query第一种方法:将许多项目 ID 加载到每个查询中

Instead, try to improve your queries.相反,请尝试改进您的查询。 Can your write a query that sends a long list of products and returns the matches?您能否编写一个查询来发送一长串产品并返回匹配项? Perhaps something like:也许是这样的:

SELECT item_id, * 
FROM   items
WHERE  item_id IN (id1, id2, id3, id4, id5, ....)

Python gives you convenient interfaces for this kind if queries, so that the IN clause can use a pythonic list. Python 为这种 if 查询提供了方便的接口,以便IN子句可以使用 pythonic 列表。 This way you can break your long list of items to, say, 60 queries with 1,000 ids each.通过这种方式,您可以将长长的项目列表分解为 60 个查询,每个查询有 1,000 个 ID。

Second approach: Use a temporary table第二种方法:使用临时表

Another interesting approach is creating a temporary table on the database with your item ids.另一种有趣的方法是使用您的项目 ID 在数据库中创建一个临时表。 Temporary tables lasts as long as the connection lives, so you won't have to worry about cleanups.只要连接存在,临时表就会持续存在,因此您不必担心清理问题。 Perhaps something like:也许是这样的:

CREATE TEMPORARY TABLE 
           item_ids_list (id INT PRIMARY KEY); # Remember indexing!

Insert the ids using an appropriate Python library:使用适当的 Python 库插入 ID:

INSERT INTO item_ids_list   ...                # Insert your 60,000 items here

Get your results:得到你的结果:

SELECT * FROM items WHERE items.id IN (SELECT * FROM items_ids_list);

First of all you join only the last thread.首先,您只加入最后一个线程。 There is no guarantee that it will be finished the last.不能保证它会在最后完成。 You should use like that:你应该这样使用:

from time import sleep
delay = 0.5
tlist = [threading.Thread(target=search_sl, args=(ra,dec, q)) for ra, dec in candidates ]
map(lambda t:t.start(), tlist)
while(any(map(lambda t:t.isAlive()))): sleep(delay)

The second issue is the running 60K threads at the moment requires really huge hardware resource:-) It's better to queue your tasks and then process by workers.第二个问题是目前运行的 60K 线程需要非常巨大的硬件资源:-) 最好将任务排队,然后由工作人员处理。 The number of worker threads must be limited.必须限制工作线程的数量。 Like that (haven't tested the code, but the idea is clear I hope):就像那样(还没有测试代码,但我希望这个想法很清楚):

from Queue import Queue
from threading import Thread
from time import sleep
tasks = Queue()
map(tasks.put, candidates)
maxthreads = 50
delay = 0.1
try:
    threads = [Thread(target=search_sl, args=tasks.get()) \
               for i in xrange(0,maxthreads) ]
except Queue.Empty:
    pass
map(lambda t:t.start(), threads)

while not tasks.empty():
    threads = filter(lambda t:t.isAlive(), threads)
    while len(threads) < maxthreads:
        try:
            t = Thread(target=search_sl, args=tasks.get())
            t.start()
            threads.append(t)
        except Queue.Empty:
            break
    sleep(delay)

while(any(map(lambda t:t.isAlive(), threads))): sleep(delay)

Since it's an IO task, neither of thread or process is good for it.由于它是一个 IO 任务,线程或进程都不适合它。 You use those if you need to parallelize computational tasks.如果您需要并行化计算任务,则可以使用它们。 So, be modern please ™, use something like gevent for parallel IO intensive tasks.所以,请现代一点™,使用类似gevent的东西来并行执行 IO 密集型任务。

http://www.gevent.org/intro.html#example http://www.gevent.org/intro.html#example

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM