简体繁体 English

CouchDB 视图：在 map 减少中可以接受多少处理？

[英]CouchDB Views: How much processing is acceptable in map reduce?

原文 2012-04-06 16:27:57 1 2 database/ nosql/ couchdb/ mapreduce

I've been toying around with Map Reduce with CouchDB.我一直在玩弄 Map Reduce with CouchDB。 Some of the examples show some possibly heavy logic within the map reduce functions.一些示例显示了 map reduce 函数中的一些可能很重的逻辑。 In one particular case, they were performing for loops within map.在一个特定案例中，他们在 map 内执行 for 循环。

Is map reduce run on every single possible document before it emits your selected documents? map reduce 在发出您选择的文档之前是否对每个可能的文档运行？

If so, I would think that means that running any kind of iterative processing within the map reduce functions would increase processing burden by an order of magnitude, at least.如果是这样，我认为这意味着在map reduce 函数中运行任何类型的迭代处理至少会增加一个数量级的处理负担。

Basically it boils down to the following question: how much logic can be performed within map reduce before its an unreasonably expensive query ?基本上它归结为以下问题：在 map reduce 中可以执行多少逻辑，然后再进行不合理的昂贵查询？

2 个解决方案

Lots of expensive processing is acceptable in CouchDB map-reduce.在 CouchDB map-reduce 中，大量昂贵的处理是可以接受的。

CouchDB views (map-reduce) are more like CREATE INDEX than they are SELECT FROM . CouchDB 视图（map-reduce）更像是CREATE INDEX而不是SELECT FROM 。

Specifically, CouchDB guarantees that a map function runs only once per document, ever.具体来说，CouchDB 保证 map function 每个文档永远只运行一次。 (Well, actually once per document change ever.) That is what the "iterative map-reduce" is. （好吧，实际上每个文档更改一次。）这就是“迭代 map-reduce”。

Therefore, suppose you had 10,000 documents and they take 1 second each to process (which is way higher than I have ever seen).因此，假设您有 10,000 个文档，每个文档需要 1秒来处理（这比我见过的要长得多）。 That is 10,000 seconds or 2.8 hours to completely build the view.完全构建视图需要 10,000 秒或 2.8 小时。 However once the view is complete, querying any row ( ?key=... ) or row slice ( ?startkey=...&endkey=... ) takes the same time as querying for documents directly.但是，一旦视图完成，查询任何行 ( ?key=... ) 或行切片 ( ?startkey=...&endkey=... ) 与直接查询文档所花费的时间相同。 Lookup time is O(log n) for the document count.文档计数的查找时间为 O(log n)。

In other words, even if it takes 1 second per document to execute the map, it will take a few milliseconds to fetch the result.换句话说，即使每个文档需要 1 秒来执行 map，获取结果也需要几毫秒。 (Of course, the view must build first, since it is actually an index.) （当然，视图必须先构建，因为它实际上是一个索引。）

Querying the db is an unrelated activity from the map/reduce of a document.查询数据库是来自文档的 map/reduce 的不相关活动。 Therefore the query cost is not impacted by the complexity of the map/reduce.因此，查询成本不受 map/reduce 复杂性的影响。

In couchdb you are querying an index.在 couchdb 中，您正在查询索引。 This means it is a copy of your data in a format optimized for query speed.这意味着它是您的数据副本，其格式针对查询速度进行了优化。 A query is not like a tablescan in sql. It does not loop through records.查询不像 sql 中的表扫描。它不会遍历记录。

So how do you make this index?那么这个索引怎么制作呢？ It is done through the map function. The map function emits a key and a value.它是通过 map function 完成的。 map function 发出一个键和一个值。 The key is put in the index.密钥放在索引中。 Some complicated map functions that you mention may loop and emit many keys and values.您提到的一些复杂的 map 函数可能会循环并发出许多键和值。 Couchdb is smart and only runs a document when it needs to, usually on create, updates, and deletes. Couchdb 很聪明，只在需要时运行文档，通常是在创建、更新和删除时。 This is why it is incremental map/reduce.这就是为什么它是增量 map/reduce。

So as you might see, a complicated map function might impact create, update, and delete speed.因此，您可能会看到，复杂的 map function 可能会影响创建、更新和删除速度。 But again couchdb is smart in that you can specify how stale the data might be when you query the index.但是 couchdb 的聪明之处在于您可以在查询索引时指定数据的陈旧程度。