如何将Upsert json文件批量添加到Azure Cosmos DB（documentDB模块）？

Question

I am using python to update a lot of data files with new observations using the documentDB module. 我正在使用python使用documentDB模块使用新的观察来更新许多数据文件。 I have to upsert 100-200 json files per minute, and the upserting operation takes up way more time than the rest of the program. 我必须每分钟上载100-200个json文件，并且上载操作比该程序的其余部分占用更多时间。 Right now I'm using the 'UpsertDocument' function from the DocumentClient in the module. 现在，我正在使用模块中DocumentClient中的“ UpsertDocument”功能。 Is there a faster/better way? 有更快/更好的方法吗？

Answer 1

You can use stored procedure for the bulk upsert operation: 您可以使用存储过程进行批量追加操作：

function bulkimport2(docObject) {
var collection = getContext().getCollection();
var collectionLink = collection.getSelfLink();

// The count of imported docs, also used as current doc index.
var count = 0;

getContext().getResponse().setBody(docObject.items);
//return

// Validate input.
//if (!docObject.items || !docObject.items.length) getContext().getResponse().setBody(docObject);
docObject.items=JSON.stringify(docObject.items)
docObject.items = docObject.items.replace("\\\\r", "");
docObject.items = docObject.items.replace("\\\\n", "");
var docs = JSON.parse(docObject.items);
var docsLength = docObject.items.length;
if (docsLength == 0) {
    getContext().getResponse().setBody(0);
    return;
}

// Call the CRUD API to create a document.
tryCreate(docs[count], callback, collectionLink,count);

// Note that there are 2 exit conditions:
// 1) The createDocument request was not accepted.
//    In this case the callback will not be called, we just call setBody and we are done.
// 2) The callback was called docs.length times.
//    In this case all documents were created and we don't need to call tryCreate anymore. Just call setBody and we are done.
function tryCreate(doc, callback, collectionLink,count ) {
    doc=JSON.stringify(doc);
    if (typeof doc == "undefined") {
        getContext().getResponse().setBody(count);
        return ;
        } else {
        doc = doc.replace("\\r", "");
        doc = doc.replace("\\n", "");
        doc=JSON.parse(doc);
       }

    getContext().getResponse().setBody(doc);

    var isAccepted = collection.upsertDocument(collectionLink, doc, callback);

    // If the request was accepted, callback will be called.
    // Otherwise report current count back to the client, 
    // which will call the script again with remaining set of docs.
    // This condition will happen when this stored procedure has been running too long
    // and is about to get cancelled by the server. This will allow the calling client
    // to resume this batch from the point we got to before isAccepted was set to false
    if (!isAccepted) {
        getContext().getResponse().setBody(count);
     }
}

// This is called when collection.createDocument is done and the document has been persisted.
function callback(err, doc, options) {
    if (err) throw getContext().getResponse().setBody(err + doc);

    // One more document has been inserted, increment the count.
    count++;

    if (count >= docsLength) {
        // If we have created all documents, we are done. Just set the response.
        getContext().getResponse().setBody(count);
        return ;
    } else {
        // Create next document.
        tryCreate(docs[count], callback,  collectionLink,count);
    }
}

and then you can load in python and execute it. 然后您可以加载python并执行它。 Please note that Stored procedure requires partitioning key. 请注意，存储过程需要分区键。

Hope it helps. 希望能帮助到你。

Answer 2

One option would be to use the Cosmos DB Spark connector instead, and, optionally (and conveniently) run as a job in Azure Databricks. 一种选择是改为使用Cosmos DB Spark连接器，并可选地（方便地）在Azure Databricks中作为作业运行。 This will provide a significant amount of control over your throughput and make it easy to find the optimal balance between parallelism (which I think is the issue) and RU capacity on the Cosmos DB. 这将对您的吞吐量提供大量控制，并使您可以轻松地在Cosmos DB上的并行度（我认为这是问题）和RU容量之间找到最佳平衡。

Here's a simple example of measurements taken loading 118K documents, and this is using a minimum spec Databricks cluster with just 1 worker. 这是一个加载118K文档的简单测量示例，它使用的是最低规格的Databricks集群，只有一个工作人员。

Single Cosmos client in Python: 28 docs/sec @ 236 RUs (ie not pushing Cosmos at all) Python中的单个Cosmos客户端：28个文档/秒@ 236 RU（即完全不推送Cosmos）

Spark Cosmos DB Adapter, 66 docs/sec @ >400 RUs (was throttled due to 400 RU limit) Spark Cosmos DB适配器，> 400 RUs，66个文档/秒（由于400 RU限制而被限制）

... after bumping up Cosmos DB to 10K RUs Spark Cosmos DB Adapter, 1317 docs/sec @ >2.9K RUs (don't think it ran long enough for accurate RUs) - still the same minimum spec cluster ...在将Cosmos DB提升至10K RU之后，Spark Cosmos DB Adapter，1317 docs / sec @> 2.9K RU（不要为精确的RU运行足够长的时间）-仍然是相同的最低规格集群

You could also try Python multi-threading (I think it will help), and as CYMA said in the comment, you should be checking for throttling at Cosmos DB. 您也可以尝试使用Python多线程（我认为这会有所帮助），正如CYMA在评论中所说，您应该在Cosmos DB中检查节流。 My observation, though, is a single Cosmos client isn't going to get you even to the minimum 400 RUs. 不过，我的观察是，一个Cosmos客户端无法让您获得最低400 RU。

如何将Upsert json文件批量添加到Azure Cosmos DB（documentDB模块）？

问题描述

2 个解决方案

解决方案1
0 2019-07-25 09:03:01

解决方案2
0 2019-07-26 03:11:22

如何将Upsert json文件批量添加到Azure Cosmos DB（documentDB模块）？

问题描述

2 个解决方案

解决方案1 0 2019-07-25 09:03:01

解决方案2 0 2019-07-26 03:11:22

解决方案1
0 2019-07-25 09:03:01

解决方案2
0 2019-07-26 03:11:22