使用 Pymongo / MongoDB 迭代 cursor object 的有效方法

Question

我預過濾 3 collections 並從中構建一個新集合。 為此，我像這樣遍歷 cursor 對象（monate、wochen、tage 是包含相關datetime時間對象的 justs 列表：

monate_final = collection1.find({"NewDate": {"$in": list(monate)}})
wochen_final = collection2.find({"NewDate": {"$in": list(wochen)}})
tage_final = collection3.find({"NewDate": {"$in": list(tage)}})

master_list = [monate_final, wochen_final, tage_final]

for collection in master_list:
    for document in collection:
        self.target.insert_one(document)

該代碼有效，但由於最終的 collections 包含超過 1 億條記錄，因此該過程只需要很長時間。 我沒有找到更有效的方法來做到這一點。 由於 memory 限制，構建 pandas DataFrame然后使用insert_many()不起作用。 誰能幫我？

Answer 1

根據此示例，使用批量操作並將批量寫入分成 50,000 個塊。 您可以使用計數器（可能更快）或只檢查len(updates) ：

from pymongo import InsertOne
updates = []
counter = 0

for collection in master_list:
    for document in collection:
        updates.append(InsertOne(document))
        counter += 1

        if counter > 50000:
            self.target.bulk_write(updates)
            counter = 0
            updates = []

# Update the final items after the cursor has exhausted
if len(updates) != 0:
    self.target.bulk_write(updates)

Answer 2

我沒有足夠的聲譽來發表評論。

我沒有驗證這一點。 但是您可以創建一個聚合管道來過濾您的 collections。 管道中的最后一步是 $out 運算符，用於將管道中的所有文檔保存到新集合中。

https://docs.mongodb.com/manual/reference/operator/aggregation/out/

我沒有深入了解 MongoDB 的內部工作原理。 但我希望管道完全在數據庫端運行，這將大大提高操作的性能。

祝你解決這個問題好運：（不要忘記為其他人添加你的解決方案：D）

使用 Pymongo / MongoDB 迭代 cursor object 的有效方法

問題描述

2 個解決方案

解決方案1
1 已采納 2020-06-03 10:30:32

解決方案2
0 2020-06-03 06:56:35

使用 Pymongo / MongoDB 迭代 cursor object 的有效方法

問題描述

2 個解決方案

解決方案1 1 已采納 2020-06-03 10:30:32

解決方案2 0 2020-06-03 06:56:35

解決方案1
1 已采納 2020-06-03 10:30:32

解決方案2
0 2020-06-03 06:56:35