简体   繁体   English

python threading具有线程安全性的队列生产者 - 消费者

[英]python threading Queue producer-consumer with thread-safe

I am using threading and Queue to fetch url and store to database. 我正在使用线程和队列来获取URL并存储到数据库。
I just want one thread to do storing job. 我只想要一个线程来做存储工作。
so I write code as below: 所以我写代码如下:

import threading
import time

import Queue

site_count = 10

fetch_thread_count = 2

site_queue = Queue.Queue()
proxy_array=[]        


class FetchThread(threading.Thread):
    def __init__(self,site_queue,proxy_array):
        threading.Thread.__init__(self)
        self.site_queue = site_queue
        self.proxy_array = proxy_array
    def run(self):
        while True:
            index = self.site_queue.get()
            self.get_proxy_one_website(index)
            self.site_queue.task_done()
    def get_proxy_one_website(self,index):
        print '{0} fetched site :{1}\n'.format(self.name,index)
        self.proxy_array.append(index)


def save():
    while True:
        if site_queue.qsize() > 0:
            if len(proxy_array) > 10:
                print 'save :{0}  to database\n'.format(proxy_array.pop())

            else:
                time.sleep(1)
        elif len(proxy_array) > 0:
            print 'save :{0} to database\n'.format(proxy_array.pop())

        elif len(proxy_array) == 0:
            print 'break'
            break
        else:
            print 'continue'
            continue

def start_crawl():
    global site_count,fetch_thread_count,site_queue,proxy_array
    print 'init'
    for i in range(fetch_thread_count):
        ft = FetchThread(site_queue,proxy_array)
        ft.setDaemon(True)
        ft.start()

    print 'put site_queue'
    for i in range(site_count):
        site_queue.put(i)

    save()

    print 'start site_queue join'
    site_queue.join()
    print 'finish'

start_crawl()

excuted output: 出口产量:

init
put site_queue
Thread-1 fetched site :0

Thread-2 fetched site :1

Thread-1 fetched site :2

Thread-2 fetched site :3

Thread-1 fetched site :4

Thread-2 fetched site :5

Thread-1 fetched site :6

Thread-2 fetched site :7

Thread-1 fetched site :8

Thread-2 fetched site :9

save :9 to database

save :8 to database

save :7 to database

save :6 to database

save :5 to database

save :4 to database

save :3 to database

save :2 to database

save :1 to database

save :0 to database

break
start site_queue join
finish
[Finished in 1.2s]

Why save() function run after site_queue.join() which written after save() . 为什么save()后函数运行site_queue.join()之后写的save()
I also have substituted save() with a thread function ,but it doesn't work too. 我也用线程函数替换了save() ,但它也不起作用。
Does it mean I must change proxy_array=[] to proxy_queue=Queue.Queue() ,then I can use theading to store data? 这是否意味着我必须将proxy_array=[]更改为proxy_queue=Queue.Queue() ,然后我可以使用theading来存储数据?
I just want one thead to do this,and there is not any other theads would get data from proxy_array , why should I join it?Using Queue seems very weird. 我只想要一个thead这样做,并且没有任何其他theads会从proxy_array获取数据,我为什么要加入它?使用Queue似乎很奇怪。
Is there any better solusion? 有没有更好的解决方案?

UPDATE: 更新:
I don't want to wait until all the FetchThreads complete their work.I want to save data while fethcing,it would be much faster. 我不想等到所有的FetchThreads完成他们的工作。我想在fraccing时保存数据,它会快得多。 I want the result be something like below(Becuase I use array.pop(),so save 0 maybe appear very later,this is just a example for easily understand. ): 我希望结果如下所示(因为我使用了array.pop(),所以保存0可能会出现得很晚,这只是一个容易理解的例子。):

Thread-2 fetched site :1

Thread-1 fetched site :2

save :0 to database

Thread-2 fetched site :3

Thread-1 fetched site :4

save :2 to database

save :3 to database


Thread-2 fetched site :5

Thread-1 fetched site :6

save :4 to database
.......

UPDATE2 for someone has same question as below: UPDATE2对某人有同样的问题如下:

question: 题:
As I saying as above context,there is not any other theads would get data from proxy_array. 正如我在上面所说的那样,没有任何其他的theads会从proxy_array获取数据。
I just can not imagine why it would break thread-safe? 我无法想象为什么它会破坏线程安全?

answer: 回答:
producer-consumer problem in misha's answer, I understand after reading it carefully. 在misha的回答中, 生产者 - 消费者问题 ,我仔细阅读后会理解。


question: 题:
And one more asking,if the Program main thread can play as comsumer with FetchThreads (in another word,needn't create StoreThread) 还有一个问题,如果程序主线程可以像FetchThreads那样扮演comsumer(换句话说,不需要创建StoreThread)

this is what I cannot figure out,I would update after finded the answer. 这是我无法弄清楚的,我会在找到答案后更新。

I have to come up with something similar producer-consumer. 我必须提出类似的生产者 - 消费者。 Producer generates an 'id' and consumer consumes that id to do some url fetch and processing it to it. 生产者生成一个'id',消费者使用该id来进行一些url获取并将其处理到它。 Here is my skeleton code which does that 这是我的骨架代码


    import Queue
    import random
    import threading
    import time
    import sys

    data_queue = Queue.Queue()
    lock = threading.Lock()

    def gcd(a, b):
        while b != 0:
            a,b = b, a%b

        return b

    def consumer(idnum):
        while True:
            try:
                data = data_queue.get(block=False)
            except Exception, e:
               print 'Exception ' + str(e)
            else:
                with lock:
                    print('\t consumer %d: computed gcd(%d, %d) = %d' %(idnum, data[0], data[1], gcd(data[0], data[1])))

            time.sleep(1)
            data_queue.task_done()

    def producer(idnum, count):
        for i in range(count):
            a,b = random.randint(1, sys.maxint), random.randint(1, sys.maxint)
            with lock:
                print('\t producer %d: generated (%d, %d)'% (idnum, a, b))
            data_queue.put((a,b))
            time.sleep(0.5)

    if __name__ == '__main__':
        num_producers = 1
        num_consumers = 2
        num_integer_pairs = 10

        for i in range(num_consumers):
            t = threading.Thread(target=consumer, args=(i,))
            t.daemon = True
            t.start()

        threads = []
        for ii in range(num_producers):
            thread = threading.Thread(target=producer, args=(ii, num_integer_pairs))
            threads.append(thread)
            thread.start()

        # wait for the producers threads to finish
        for thread in threads:
            thread.join()
        print 'done with producer threads'

        # wait till all the jobs are done in the queue
        data_queue.join()

        with lock:
            print 'all consumer threads finished'

        with lock:
            print 'main thread exited'

I recommend you read about the producer-consumer problem . 我建议你阅读生产者 - 消费者问题 Your producers are the fetch threads. 你的生产者是获取线程。 Your consumer is the save function. 您的消费者是save功能。 If I understand correctly, you want the consumer to save the fetched result as soon as its available. 如果我理解正确,您希望消费者尽快保存获取的结果。 For this to work, the producer and consumer must be able to communicate in some thread-safe way (eg a queue). 为此,生产者和消费者必须能够以某种线程安全的方式进行通信(例如队列)。

Basically, you need another queue. 基本上,您需要另一个队列。 It would replace proxy_array . 它将取代proxy_array Your save function will look something like this: 您的save功能将如下所示:

while True:
 try:
   data = fetch_data_from_output_queue()
   save_to_database(data)
 except EmptyQueue:
   if not stop_flag.is_set():
     # All done
     break
   time.sleep(1)
   continue

This save function will need to run in its own thread. 这个save功能需要在自己的线程中运行。 stop_flag is an Event that gets set after you join your fetch threads. stop_flag您加入获取线程设置的事件

From a high level, your application will look like this: 从较高的层面来看,您的应用程序将如下所示:

input_queue = initialize_input_queue()
ouput_queue = initialize_output_queue()

stop_flag = Event()
create_and_start_save_thread(output_queue) # read from output queue, save to DB
create_and_start_fetch_threads(input_queue, output_queue) # get sites to crawl from input queue, push crawled results to output_queue
join_fetch_threads() # this will block until the fetch threads have gone through everything in the input_queue
stop_flag.set() # this will inform the save thread that we are done
join_save_thread() # wait for all the saving to complete

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM