简体   繁体   English

如何将Upsert json文件批量添加到Azure Cosmos DB(documentDB模块)?

[英]How to bulk Upsert json files to azure Cosmos DB (documentDB module)?

I am using python to update a lot of data files with new observations using the documentDB module. 我正在使用python使用documentDB模块使用新的观察来更新许多数据文件。 I have to upsert 100-200 json files per minute, and the upserting operation takes up way more time than the rest of the program. 我必须每分钟上载100-200个json文件,并且上载操作比该程序的其余部分占用更多时间。 Right now I'm using the 'UpsertDocument' function from the DocumentClient in the module. 现在,我正在使用模块中DocumentClient中的“ UpsertDocument”功能。 Is there a faster/better way? 有更快/更好的方法吗?

You can use stored procedure for the bulk upsert operation: 您可以使用存储过程进行批量追加操作:

function bulkimport2(docObject) {
var collection = getContext().getCollection();
var collectionLink = collection.getSelfLink();

// The count of imported docs, also used as current doc index.
var count = 0;

getContext().getResponse().setBody(docObject.items);
//return

// Validate input.
//if (!docObject.items || !docObject.items.length) getContext().getResponse().setBody(docObject);
docObject.items=JSON.stringify(docObject.items)
docObject.items = docObject.items.replace("\\\\r", "");
docObject.items = docObject.items.replace("\\\\n", "");
var docs = JSON.parse(docObject.items);
var docsLength = docObject.items.length;
if (docsLength == 0) {
    getContext().getResponse().setBody(0);
    return;
}

// Call the CRUD API to create a document.
tryCreate(docs[count], callback, collectionLink,count);

// Note that there are 2 exit conditions:
// 1) The createDocument request was not accepted.
//    In this case the callback will not be called, we just call setBody and we are done.
// 2) The callback was called docs.length times.
//    In this case all documents were created and we don't need to call tryCreate anymore. Just call setBody and we are done.
function tryCreate(doc, callback, collectionLink,count ) {
    doc=JSON.stringify(doc);
    if (typeof doc == "undefined") {
        getContext().getResponse().setBody(count);
        return ;
        } else {
        doc = doc.replace("\\r", "");
        doc = doc.replace("\\n", "");
        doc=JSON.parse(doc);
       }

    getContext().getResponse().setBody(doc);

    var isAccepted = collection.upsertDocument(collectionLink, doc, callback);

    // If the request was accepted, callback will be called.
    // Otherwise report current count back to the client, 
    // which will call the script again with remaining set of docs.
    // This condition will happen when this stored procedure has been running too long
    // and is about to get cancelled by the server. This will allow the calling client
    // to resume this batch from the point we got to before isAccepted was set to false
    if (!isAccepted) {
        getContext().getResponse().setBody(count);
     }
}

// This is called when collection.createDocument is done and the document has been persisted.
function callback(err, doc, options) {
    if (err) throw getContext().getResponse().setBody(err + doc);

    // One more document has been inserted, increment the count.
    count++;

    if (count >= docsLength) {
        // If we have created all documents, we are done. Just set the response.
        getContext().getResponse().setBody(count);
        return ;
    } else {
        // Create next document.
        tryCreate(docs[count], callback,  collectionLink,count);
    }
}

and then you can load in python and execute it. 然后您可以加载python并执行它。 Please note that Stored procedure requires partitioning key. 请注意,存储过程需要分区键。

Hope it helps. 希望能帮助到你。

One option would be to use the Cosmos DB Spark connector instead, and, optionally (and conveniently) run as a job in Azure Databricks. 一种选择是改为使用Cosmos DB Spark连接器,并可选地(方便地)在Azure Databricks中作为作业运行。 This will provide a significant amount of control over your throughput and make it easy to find the optimal balance between parallelism (which I think is the issue) and RU capacity on the Cosmos DB. 这将对您的吞吐量提供大量控制,并使您可以轻松地在Cosmos DB上的并行度(我认为这是问题)和RU容量之间找到最佳平衡。

Here's a simple example of measurements taken loading 118K documents, and this is using a minimum spec Databricks cluster with just 1 worker. 这是一个加载118K文档的简单测量示例,它使用的是最低规格的Databricks集群,只有一个工作人员。

Single Cosmos client in Python: 28 docs/sec @ 236 RUs (ie not pushing Cosmos at all) Python中的单个Cosmos客户端:28个文档/秒@ 236 RU(即完全不推送Cosmos)

Spark Cosmos DB Adapter, 66 docs/sec @ >400 RUs (was throttled due to 400 RU limit) Spark Cosmos DB适配器,> 400 RUs,66个文档/秒(由于400 RU限制而被限制)

... after bumping up Cosmos DB to 10K RUs Spark Cosmos DB Adapter, 1317 docs/sec @ >2.9K RUs (don't think it ran long enough for accurate RUs) - still the same minimum spec cluster ...在将Cosmos DB提升至10K RU之后,Spark Cosmos DB Adapter,1317 docs / sec @> 2.9K RU(不要为精确的RU运行足够长的时间)-仍然是相同的最低规格集群

You could also try Python multi-threading (I think it will help), and as CYMA said in the comment, you should be checking for throttling at Cosmos DB. 您也可以尝试使用Python多线程(我认为这会有所帮助),正如CYMA在评论中所说,您应该在Cosmos DB中检查节流。 My observation, though, is a single Cosmos client isn't going to get you even to the minimum 400 RUs. 不过,我的观察是,一个Cosmos客户端无法让您获得最低400 RU。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM