簡體   English   中英

PyMongo:如何在 MongoDB 中批量更新巨大的 JSON 數據

[英]PyMongo: How to do bulk update of huge JSON data in MongoDB

我從 API 中提取 JSON 數據,輸出如下:

[[{'employeeId': 1, 'lastName': 'Smith'}, {'employeeId': 2, 'lastName': 'Flores'}]]

列表中有大約25 萬個對象。 我能夠遍歷列表中的對象並以這種方式通過PyMongo執行update_one

json_this = json.dumps(json_list[0])
json_that = json.loads(json_this)
for x in json_that:
    collection.update_one({"employeeId": x['employeeId']},{"$set": x},upsert=True)

但是對於25 萬條記錄,這需要很長時間。 我正在嘗試使用update_many但無法弄清楚如何正確轉換/格式化此 JSON 列表以使用update_many函數。 任何指導將不勝感激。

250K文檔更新/插入數據庫可能是一項艱巨的任務,您不能使用update_many因為過濾器查詢和更新值在每個字典之間會發生變化。 因此,通過以下查詢,您至少可以避免多次調用數據庫,但我不確定這對您的場景有多好,請注意,我是 python 的初學者,這是一個基本代碼,可以給您一個想法:

您可以為批量操作做的最好的事情是PyMongo-bulk ,由於.bulkWrite() 的限制,我們將250K記錄分成塊:

from pymongo import UpdateOne
from pprint import pprint
import sys

json_this = json.dumps(json_list[0])
json_that = json.loads(json_this)

primaryBulkArr = []
secondaryBulkArr = []
thirdBulkArr = []

## Here we're splicing 250K records into 3 arrays, in case if we want to finish a chunk at a time,
 # No need to splice all at once - Finish end - to - end for one chunk & restart the process for another chunk from the index of the list where you left previously

for index, x in enumerate(json_that):
    if index < 90000:
        primaryBulkArr.append(
            UpdateOne({"employeeId": x['employeeId']}, {'$set': x}, upsert=True))
    elif index > 90000 and index < 180000:
        secondaryBulkArr.append(
            UpdateOne({"employeeId": x['employeeId']}, {'$set': x}, upsert=True))
    else:
        thirdBulkArr.append(
            UpdateOne({"employeeId": x['employeeId']}, {'$set': x}, upsert=True))

## Reason why I've spliced into 3 arrays is may be you can run below code in parallel if your DB & application servers can take it,
# At the end of the day irrespective of time taken only 3 DB calls are needed & this bulk op is much efficient.
try:
    result = collection.bulk_write(bulkArr)
    ## result = db.test.bulk_write(bulkArr, ordered=False)
    # Opt for above if you want to proceed on all dictionaries to be updated, even though an error occured in between for one dict
    pprint(result.bulk_api_result)
except:
    e = sys.exc_info()[0]
    print("An exception occurred ::", e) ## Get the ids failed if any & do re-try

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM