[英]PyMongo: How to do bulk update of huge JSON data in MongoDB
我從 API 中提取 JSON 數據,輸出如下:
[[{'employeeId': 1, 'lastName': 'Smith'}, {'employeeId': 2, 'lastName': 'Flores'}]]
列表中有大約25 萬個對象。 我能夠遍歷列表中的對象並以這種方式通過PyMongo執行update_one
:
json_this = json.dumps(json_list[0])
json_that = json.loads(json_this)
for x in json_that:
collection.update_one({"employeeId": x['employeeId']},{"$set": x},upsert=True)
但是對於25 萬條記錄,這需要很長時間。 我正在嘗試使用update_many
但無法弄清楚如何正確轉換/格式化此 JSON 列表以使用update_many
函數。 任何指導將不勝感激。
將250K文檔更新/插入數據庫可能是一項艱巨的任務,您不能使用update_many
因為過濾器查詢和更新值在每個字典之間會發生變化。 因此,通過以下查詢,您至少可以避免多次調用數據庫,但我不確定這對您的場景有多好,請注意,我是 python 的初學者,這是一個基本代碼,可以給您一個想法:
您可以為批量操作做的最好的事情是PyMongo-bulk ,由於.bulkWrite() 的限制,我們將250K記錄分成塊:
from pymongo import UpdateOne
from pprint import pprint
import sys
json_this = json.dumps(json_list[0])
json_that = json.loads(json_this)
primaryBulkArr = []
secondaryBulkArr = []
thirdBulkArr = []
## Here we're splicing 250K records into 3 arrays, in case if we want to finish a chunk at a time,
# No need to splice all at once - Finish end - to - end for one chunk & restart the process for another chunk from the index of the list where you left previously
for index, x in enumerate(json_that):
if index < 90000:
primaryBulkArr.append(
UpdateOne({"employeeId": x['employeeId']}, {'$set': x}, upsert=True))
elif index > 90000 and index < 180000:
secondaryBulkArr.append(
UpdateOne({"employeeId": x['employeeId']}, {'$set': x}, upsert=True))
else:
thirdBulkArr.append(
UpdateOne({"employeeId": x['employeeId']}, {'$set': x}, upsert=True))
## Reason why I've spliced into 3 arrays is may be you can run below code in parallel if your DB & application servers can take it,
# At the end of the day irrespective of time taken only 3 DB calls are needed & this bulk op is much efficient.
try:
result = collection.bulk_write(bulkArr)
## result = db.test.bulk_write(bulkArr, ordered=False)
# Opt for above if you want to proceed on all dictionaries to be updated, even though an error occured in between for one dict
pprint(result.bulk_api_result)
except:
e = sys.exc_info()[0]
print("An exception occurred ::", e) ## Get the ids failed if any & do re-try
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.