如何在python 2.7中使用pymongo進行多處理池

Question

我正在與Pymongo和Multiprocessing Pool一起運行10個進程並從API獲取數據並將輸出插入到mongodb中。

我認為我編寫代碼的方式做錯了，因為python顯示雙連接打開比通常情況要多; 例如：如果我運行10個進程，Mongodb將輸出20個或更多連接已建立，我將在啟動時收到以下警告：

UserWarning：MongoClient在fork之前打開。 使用connect = False創建MongoClient，或者在分叉后創建客戶端。 有關詳細信息，請參閱PyMongo的文檔： http ：//api.mongodb.org/python/current/faq.html#using-pymongo-with-multiprocessing>

甚至我在mongodb的連接器客戶端輸入connect = False。 這是一個示例代碼，用於了解我如何使用pymongo並請求API在池中發送請求：

# -*- coding: utf-8 -*-
#!/usr/bin/python

import json # to decode and encode json
import requests # web POST and GET requests. 
from pymongo import MongoClient # the mongo driver / connector
from bson import ObjectId # to generate bson object for MongoDB
from multiprocessing import Pool # for the multithreading

# Create the mongoDB Database object, declare collections
client = MongoClient('mongodb://192.168.0.1:27017,192.168.0.2:27017./?replicaSet=rs0', maxPoolSize=20, connect=False)
index = client.database.index
users = client.database.users

def get_user(userid):

    params = {"userid":userid}
    r = requests.get("https://exampleapi.com/getUser",params=params)
    j = json.loads(r.content)
    return j

def process(index_line):

    user = get_user(index_line["userid"])
    if(user):
        users.insert(user)

def main():

    # limit to 100,000 lines of data each loop
    limited = 100
    # skip number of lines for the loop (getting updated)
    skipped = 0
    while True:
        # get cursor with data from index collection
        cursor = index.find({},no_cursor_timeout=True).skip(skipped).limit(limited)
        # prepare the pool with threads
        p = Pool(10)
        # start multiprocessing the pool with the dataset
        p.map(process, cursor)
        # after pool finished, kill it with fire
        p.close()
        p.terminate()
        p.join()
        # after finishing the 100k lines, go for another round, inifnite.
        skipped = skipped + limited
        print "[-] Skipping %s " % skipped

if __name__ == '__main__':
    main()

我的代碼算法有什么問題嗎？ 有沒有辦法讓它更有效率，更好地工作，更好地控制我的游泳池？

我已經研究了很長一段時間但是找不到辦法以更好的方式做我想做的事情，希望得到一些幫助。

謝謝。

Answer 1

建議為每個進程創建一次MongoClient ，而不是為每個進程共享同一個客戶端。

這是因為MongoClient還使用連接池處理來自進程的多個連接，並且不是fork安全的 。

首先，您希望確保當要處理的集合中的每個文檔都已用完時，while循環中斷。 雖然這不是一個太精細的條件，但是如果skipped大於文檔數，則可以打破循環。

其次，在循環外部初始化進程Pool並在循環內映射進程。 multiprocessing.Pool.map等待子進程完成並返回，因此加入池將導致異常。 如果您想異步運行子進程，可以考慮使用multiprocessing.Pool.async_map 。

您可以使用multiprocessing.Queue ，producer和consumer進程以更好的方式顯式實現它。 生產者進程將向隊列添加任務以由消費者進程執行。 以這種方式實現解決方案的好處並不是很清楚，因為多處理庫也使用了隊列。

import requests # web POST and GET requests. 
from pymongo import MongoClient # the mongo driver / connector
from bson import ObjectId # to generate bson object for MongoDB
from multiprocessing import Pool # for the multithreading


def get_user(userid):
    params = {"userid": userid}
    rv = requests.get("https://exampleapi.com/getUser", params=params)
    json = rv.json()
    return json['content']


def create_connect():
    return MongoClient(
       'mongodb://192.168.0.1:27017,192.168.0.2:27017/?replicaSet=rs0', maxPoolSize=20
    )

def consumer(index_line):
    client = create_connect()
    users = client.database.users

    user = get_user(index_line["_id"])
    if user:
        users.insert(user)

def main():

    # limit to 100,000 lines of data each loop
    limited = 100
    # skip number of lines for the loop (getting updated)
    skipped = 0
    client = create_connect()
    index = client.database.index
    pool = Pool(10)

    count = index.count()

    while True:

        if skipped > count:
            break

        cursor = index.find({}).skip(skipped).limit(limited)

        pool.map(consumer, cursor)

        skipped = skipped + limited
        print("[-] Skipping {}".format(skipped))

if __name__ == '__main__':
    main()

如何在python 2.7中使用pymongo進行多處理池

問題描述

1 個解決方案

解決方案1
3 已采納 2018-01-02 14:06:46

如何在python 2.7中使用pymongo進行多處理池

問題描述

1 個解決方案

解決方案1 3 已采納 2018-01-02 14:06:46

解決方案1
3 已采納 2018-01-02 14:06:46