[英]how can I create threads for two different queues to run in parallel? - Python
I have two queues for different tasks, the first crawl will start to crawl the links in the list and then it will generate more links to crawl to the queue one and also will generate new links to a different task on queue two, my program is working but the problem is: When the workers for the queue two start running it stops the workers from queue one, they are basically not running in parallel they are waiting each other finish their tasks. 我有两个用于执行不同任务的队列,第一个爬网将开始对列表中的链接进行爬网,然后它将生成更多的链接以爬网到一个队列,并且还将在第二个队列中生成指向另一个任务的新链接,我的程序是工作正常,但问题是:当队列2的工作程序开始运行时,它使队列1的工作程序停止了,他们基本上不是并行运行的,他们正在等待对方完成任务。 How can I make them run in parallel?
如何使它们并行运行?
import threading
from queue import Queue
queue = Queue()
queue_two = Queue()
links = ['www.example.com', 'www.example.com', 'www.example.com',
'www.example.com', 'www.example.com', 'www.example.com',
'www.example.com', 'www.example.com', 'www.example.com']
new_links = []
def create_workers():
for _ in range(4):
t = threading.Thread(target=work)
t.daemon = True
t.start()
for _ in range(2):
t = threading.Thread(target=work_two)
t.daemon = True
t.start()
def work():
while True:
work = queue.get()
#do something
queue.task_done()
def work_two():
while True:
work = queue_two.get()
#do something
queue_two.task_done()
def create_jobs():
for link in links:
queue.put(link)
queue.join()
crawl_two()
crawl()
def create_jobs_two():
for link in new_links:
queue_two.put(link)
queue_two.join()
crawl_two()
def crawl():
queued_links = links
if len(queued_links) > 0:
create_jobs()
def crawl_two():
queued_links = new_links
if len(queued_links) > 0:
create_jobs_two()
create_workers()
crawl()
That is because your processing does not seem to be parallel between work and work two. 那是因为您的处理似乎在工作与工作两个之间并不并行。
This is what happens: 这是发生了什么:
queue.join()
until all of them have completed queue.join()
等待,直到全部完成 queue_two.join()
until all of them have completed queue_two.join()
等待,直到它们全部完成 Basically you never enter in a situation where work and work_two would run in parallel, as you use queue.join() to wait until all of the currently running tasks have finished. 基本上,您永远不会进入work和work_two会并行运行的情况,因为您使用queue.join()等待所有当前正在运行的任务完成。 Only then will you assign tasks to the "other" queue.
只有这样,您才能将任务分配给“其他”队列。 Your work and work_two do run in parallel within themselves, but the control structure ensures work and work_two are mutually exclusive.
您的work和work_two确实在它们内部并行运行,但是控制结构确保work和work_two是互斥的。 You need to redesign the loops and queues if you want both of them run in parallel.
如果希望它们都并行运行,则需要重新设计循环和队列。
You will probably also want to investigate the use of threading.Lock()
to protect your global new_links
variable, as I assume you will be appending things to it in your worker threads. 您可能还需要研究使用
threading.Lock()
来保护全局new_links
变量,因为我假设您将在工作线程中添加一些内容。 This is absolutely fine but you need a lock to ensure two threads are not trying to do this simultaneously. 绝对可以,但是您需要一个锁以确保两个线程不会同时尝试执行此操作。 But this is not related to your current problem.
但这与您当前的问题无关。 This only helps you avoid the next problem.
这只会帮助您避免下一个问题。
I of course do not know what you try to achieve here, but you might try solving your problem and avoiding the next one by scrapping the global new_links completely. 我当然不知道您要在这里实现什么,但是您可以尝试通过完全废弃全局new_links来解决问题并避免出现下一个问题。 What if your work and work_two just fed the queue of the other worker if they needed to submit tasks to them, instead of putting items into a global variable and then feeding them to the queue in the main thread?
如果您的work和work_two只是将另一个工作人员的队列提交给他们,而他们需要向他们提交任务,而不是将项目放入一个全局变量,然后再将它们提供给主线程的队列,该怎么办?
Or you could build an "orchestration" thread that would queue tasks to workers, process responses from them and then act on that response accordingly, either queuing it back to one of the queues or accepting the result if it is "ready". 或者,您可以构建一个“编排”线程,该线程将任务排队给工作人员,处理来自他们的响应,然后相应地对该响应进行操作,将其排队回到一个队列中,或者如果结果为“就绪”,则接受结果。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.