简体   繁体   English

MongoDB中两个集合之间的Diff()

[英]Diff() between two collections in MongoDB

I have done research. 我做过研究。 I apologize if this is a duplicate question, but the solutions in other questions were not really my fit, and thus, I made a new question. 如果这是一个重复的问题我很抱歉,但其他问题的解决方案并不是我的合适,因此,我提出了一个新问题。

What is the best way with Javascript to compare two collections? 使用Javascript比较两个集合的最佳方法是什么?

I have thousands of these headers in this Mongo document format: 我有这个Mongo文档格式的数千个标题:

{
    "url": "google.com",
    "headers": {
        "location": "http://www.google.com/",
        "content-type": "text/html; charset=UTF-8",
        "date": "Mon, 25 Mar 2013 18:12:08 GMT",
        "expires": "Wed, 24 Apr 2013 18:12:08 GMT",
        "cache-control": "public, max-age=2592000",
        "server": "gws",
        "content-length": "219",
        "x-xss-protection": "1; mode=block",
        "x-frame-options": "SAMEORIGIN"
    }
}

I ran my scraper today. 我今天跑了刮刀。 I would, in the future, run it again, and store that in a second collection. 在将来,我会再次运行它,并将其存储在第二个集合中。 Additionally, I would like to be able to compare three specific header objects, and that is server , x-aspnet-version , and x-powered-by , and detect if there are any integer increments. 另外,我希望能够比较三个特定的头对象,即serverx-aspnet-versionx-powered-by ,并检测是否有任何整数增量。

What is the best way to iterate through two collections and do a diff()? 迭代两个集合并执行diff()的最佳方法是什么?

Am I doing it right? 我做得对吗? Any suggestions would be really appreciated. 任何建议都会非常感激。

A couple of suggestions: 一些建议:

You could use a combination of url and the date accessed (at least part of the datetime object) as the _id for these objects since from what I can tell you plan to scrape each url once a month. 你可以使用url和访问日期的组合(至少是datetime对象的一部分)作为这些对象的_id,因为我可以告诉你计划每月抓一个url。

Example: 例:

{
    "_id": {
        "url": "www.google.com",
        "date": ISODate("2013-03-01"),
    },
    // Other attributes
}

This yields performance, uniqueness, and query dividends (see this 4sq blog post ). 这会产生性能,唯一性和查询红利(请参阅此4sq博客文章 )。 You could query doing something like: 你可以查询做类似的事情:

db.collection.find({
    "_id": {
        "$gte": {
            "url": yourUrl,
            "date": rangeStart
         },
         "$lt": {
            "url": yourUrl,
            "date": rangeEnd
         },
    }
})

Which yields excellent, nicely sorted (by url THEN by date, which seems to be just what you want) results. 哪个产生优秀,排序很好(按日期,按日期,这似乎是你想要的)结果。 You could also use this index to perform covered queries (over the _id field) if you just want a nice set of all of the urls and months you have scraped (this could set you up nicely to go through each url one at a time). 你也可以使用这个索引来执行覆盖的查询(在_id字段上),如果你只想要一个很好的所有网址和你已经抓过的月份(这可以很好地让你很好地通过每个网址一次) 。

If you have specific attributes of the document that you're interested in comparing ( headers.server for example) and a specific comparison you want to do for them (looking for any increment in version numbers for example), I would use some kind of regex to grab the elements relevant to version number (a quick and dirty one might simply retrieve all numeric elements) and graph them for each url (I assume this would let you visualize changes to server software over time). 如果你有比较感兴趣的文档的特定属性(例如headers.server )和你想要为它们做的特定比较(例如寻找版本号的任何增量),我会使用某种正则表达式获取与版本号相关的元素(快速和脏的可能只是检索所有数字元素)并为每个URL绘制图形(我假设这可以让您可视化服务器软件随时间的变化)。 You could just as easily report whenever any of these attributes changed by scanning them in order and setting off some event when the strings were not identical (perhaps then reporting the change or the numerical piece of the change). 您可以通过按顺序扫描任何这些属性来轻松报告,并在字符串不相同时引发某些事件(可能随后报告更改或更改的数字部分)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM