简体   繁体   English

在CouchDB视图中引用外部文档

[英]Referencing external doc in CouchDB view

I am scraping an 90K record database using JSON-RPC and I am trying to put in some basic error checking. 我正在使用JSON-RPC抓取90K记录数据库,并且尝试进行一些基本的错误检查。 I want to start by scraping the database twice using two different settings and adding a prefix to the second scrape. 首先,我想使用两个不同的设置对数据库进行两次抓取,然后在第二次抓取中添加前缀。 This way I can check to ensure that the two settings are not producing different records (due to dropped updates, etc). 这样,我可以检查以确保两个设置不会产生不同的记录(由于删除更新等)。 I wanted to implement the comparison using a view which compares each document from the first scrape with it's twin produced by the second scrape and then emit the names of records with a difference between them. 我想使用一个视图来实现比较,该视图将第一个刮擦中的每个文档与第二个刮擦生成的双文档进行比较,然后发出记录名称,两者之间有所不同。

However, I cannot quite figure out how to pull in another doc in the view, everything I have read only discusses external docs using the emit() function, which is too late to permit me to compare it. 但是,我不能完全弄清楚如何在视图中放入其他文档,我所读的内容仅讨论使用emit()函数的外部文档,这太晚了,我无法进行比较。 In the example below, the lookup() function would grab the referenced document. 在下面的示例中, lookup()函数将获取引用的文档。

Is this just not possible? 这是不可能的吗?

function(doc) {
  if(doc._id.slice(0,1)!=='$' && doc._id.slice(0,1)!== "_"){
    var otherDoc = lookup('$test" + doc._id);
    if(otherDoc){
    var keys = doc.value.keys();
    var same = true;
    keys.forEach(function(key) {
      if ((key.slice(0,1) !== '_') && (key.slice(0,1) !=='$') && (key!=='expires')) {
        if (!Object.equal(otherDoc[key], doc[key])) {
          same = false;
        }
      }
    });
      if(!same){
        emit(doc._id, 1);
      }
    }
  }
}

Context 上下文

You are correct that this is not possible in CouchDB. 您是正确的,这在CouchDB中是不可能的。 The whole point of the map function is that it must be idempotent, otherwise you lose all the other nice benefits of a pre-calculated index. map函数的全部要点是必须是幂等的,否则您将失去预先计算的索引的所有其他好处。

This is why you cannot access external resources in the map function, whether they be other records or the clock. 这就是为什么您无法访问map函数中的外部资源,无论它们是其他记录还是时钟。 Any time you run a map you must always get the same result if you put the same record into it. 每次运行地图时,如果将相同的记录放入其中,则必须始终获得相同的结果。 Since there are no relationships between records in CouchDB, you cannot promise that this is possible. 由于CouchDB中的记录之间没有关系,所以您不能保证这是可能的。

Solution

However, you can still achieve your end goal, just be different means. 但是,您仍然可以实现最终目标,只是采用不同的方式。 Some possibilities... 一些可能性...

  • Assuming there is some meaningful numeric value in each doc, you could use a view to take the sum of all those values and group them by which import you did ( {key: <batch id>, value: <meaningful number>} ). 假设每个文档中都有一些有意义的数值,您可以使用一个视图来获取所有这些值的总和,然后按进行导入的方式将它们分组( {key: <batch id>, value: <meaningful number>} )。 Then compare the two numbers in your client or the browser to see if they match. 然后在客户端或浏览器中比较这两个数字,以查看它们是否匹配。

  • A brute force approach would be to use a view to pair the docs that should match. 蛮力方法是使用视图来配对应匹配的文档。 Each doc is on a different row, but they're grouped by a common field. 每个文档都在不同的行上,但是它们按一个公共字段分组。 Then iterate through the entire index comparing the pairs. 然后遍历整个索引,比较这些对。 This would certainly be the quickest to code and doesn't depend on your application or data. 当然,这将是最快的编码,并且不依赖于您的应用程序或数据。

  • Implement a validation function to enforce a schema on your data. 实现验证功能以对数据强制实施架构。 Just be warned that this will reduce your write throughput since each written record will be piped out of Erlang and into the JS engine. 请注意,这将降低您的写吞吐量,因为每个写记录都将通过管道从Erlang传递到JS引擎中。 Also, this is only applicable if you're worried about properly formed records instead of their precise content, which might not be the case. 另外,仅当您担心格式正确的记录而不是它们的精确内容时才适用,而事实并非如此。

  • Instead of your different batch jobs creating different docs, have them place them into the same doc. 不要让您的不同批处理作业创建不同的文档,而应将它们放入同一文档中。 The structure might look like this: { "_id": "something meaningful", "batch_one": { ..data.. }, "batch_two": { ..data.. } } Then your validation function could compare them or you could create a view that indexes all the docs that don't match. 该结构可能看起来像这样: { "_id": "something meaningful", "batch_one": { ..data.. }, "batch_two": { ..data.. } }然后,您的验证功能可以将它们进行比较或者您可以可以创建一个索引所有不匹配文档的视图。 All depends on where in your pipeline you want to do the error checking and correction. 所有这些都取决于您要在管道中的哪个位置进行错误检查和更正。

Personally I like the last option better, but only if you don't plan to use the database as is in production. 就个人而言,我更喜欢最后一个选项,但前提是您不打算像在生产中那样使用数据库。 Ie., you wouldn't want to carry around all that extra data in each record. 即,您不想在每条记录中携带所有多余的数据。

Hope that helps. 希望能有所帮助。

Cheers. 干杯。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM