简体   繁体   English

在Python中使用多重处理将项目追加到列表

[英]Appending an item to a list using multiprocessing in Python

In got this block of code: 在获得以下代码块:

def get_spain_accomodations():
    pool = Pool()
    links = soup.find_all('a', class_="hotel_name_link url")
    pool.map(get_page_links, links)

    #for a in soup.find_all('a', class_="hotel_name_link url"):
    #    hotel_url = "https://www.booking.com" + a['href'].strip()
    #    hotels_url_list.append(hotel_url)

def get_page_links(link):
     hotel_url = "https://www.booking.com" + link['href'].strip()
     hotels_url_list.append(hotel_url)

For some reason the hotel_url is not being appended to the list. 由于某种原因,没有将hotel_url附加到列表中。 If I try with the commented loop it actually works, but not with the map() function. 如果我尝试使用带注释的循环,则它实际上可以工作,但不能与map()函数一起工作。 I also printed hotel_url for each get_page_links call and it worked. 我还为每个get_page_links调用打印了hotel_url,并且效果很好。 I have no idea what is going on. 我不知道发生了什么。 Below are the function callings. 下面是函数调用。

init_BeautifulSoup()
get_spain_accomodations()
#get_hotels_wifi_rating()

for link in hotels_url_list:
    print link

The code is executed without errors but the link list is not being printed. 代码执行无误,但未打印链接列表。

It's important to understand that processes run in isolated areas of memory. 重要的是要了解进程在隔离的内存区域中运行。 Each process will have their own instance of hotels_url_list and there's no (easy) way of "sticking" those values into the parent process' list: if in the parent process you create an instance of list , that instance is not the same that the subprocesses use: When you do a .fork() (aka create a subprocess), the memory of the parent process is cloned on the child process. 每个进程将具有自己hotels_url_list 实例 ,并且没有(轻松)将这些值“粘贴”到父进程的列表中的方法:如果在父进程中创建list的实例,则该实例与子进程不同用途:执行.fork() (也称为创建子进程)时,父进程的内存将在子进程上克隆 So, if the parent had an instance of list in the hotels_url_list variable, you'll also have an instance of list (also called hotels_url_list ) in the child process BUT they will not be the same (they'll occupy different areas in memory). 所以,如果家长有实例listhotels_url_list变量,你也有一个实例list (也称为hotels_url_list子进程),但他们不会是相同的(它们会占用内存的不同区域) 。

This doesn't happen with Threads . Threads不会发生这种情况。 They do share memory. 他们确实共享内存。

I would say (it's not like I'm much of an expert here) that the canonical way of communicating processes in this case would be a Queue : The child process puts things in the queue, the parent process grabs them: 我会说(这与我的专家并不多),在这种情况下,交流流程的规范方式将是Queue :子流程将事物放入队列,父流程将它们获取:

from multiprocessing import Process, Queue


def get_spain_accomodations():
    q = Queue()
    processes = []
    links = ['http://foo.com', 'http://bar.com', 'http://baz.com']
    hotels_url_list = []
    for link in links:
        p = Process(target=get_page_links, args=(link, q,))
        p.start()
        processes.append(p)
    for p in processes:
        p.join()
        hotels_url_list.append(q.get())
    print("Collected: %s" % hotels_url_list)


def get_page_links(link, q):
    print("link==%s" % link)
    hotel_url = "https://www.booking.com" + link
    q.put(hotel_url)


if __name__ == "__main__":
    get_spain_accomodations()

This outputs each link prepended with https://www.booking.com , the pre-pending happening on independent processes: 这将输出每个以https://www.booking.com开头的链接,该链接发生在独立进程上:

link==http://foo.com
link==http://bar.com
link==http://baz.com
Collected: ['https://www.booking.comhttp://foo.com', 'https://www.booking.comhttp://bar.com', 'https://www.booking.comhttp://baz.com']

I don't know if it will help you, but to me, it helps seeing the Queue as a "shared file" that both processes know about. 我不知道它是否对您有帮助,但对我来说,这有助于将队列视为两个进程都知道的“共享文件”。 Imagine you have two complete different programs, and one of them knows that has to write things into a file called /tmp/foobar.txt and the other one knows that has to read from a file called /tmp/foobar.txt . 假设您有两个完全不同的程序,其中一个知道必须将内容写入一个名为/tmp/foobar.txt的文件中,而另一个知道必须从一个名为/tmp/foobar.txt的文件中进行读取。 That way they can "communicate" with each other. 这样,他们就可以彼此“交流”。 This paragraph is just a "metaphor" (although that's pretty much how Unix pipes work)... Is not like queues work exactly like that, but maybe it helps understanding the concept? 本段只是一个“隐喻”(尽管这几乎是Unix管道的工作方式)...并不是像队列那样工作,但是也许可以帮助理解这个概念? Dunno, really, maybe I made it more confusing... 邓诺,真的,也许我让它更加令人困惑...

Another way would be using Threads and collect their return value, as explained here . 另一种方法是使用线程 ,并收集他们的返回值,如解释在这里

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM