简体   繁体   English

使用Node.js和MongoDB检测站点的平台更改所需的建议

[英]Suggestions needed for using Node.js and MongoDB to detect platform changes of a site

I am in need of some advice for this project I am working on. 我需要我正在从事的这个项目的一些建议。

I am currently working on a project requesting headers; 我目前正在研究要求标头的项目; an example of a scraped header is below, in Mongo document-style: 下面是采用Mongo文档样式的抓取标头的示例:

{
    "url": "google.com",
    "statusCode": 301,
    "headers": {
        "location": "http://www.google.com/",
        "content-type": "text/html; charset=UTF-8",
        "date": "Mon, 25 Mar 2013 13:50:31 GMT",
        "expires": "Wed, 24 Apr 2013 13:50:31 GMT",
        "cache-control": "public, max-age=2592000",
        "server": "gws",
        "content-length": "219",
        "x-xss-protection": "1; mode=block",
        "x-frame-options": "SAMEORIGIN"
    }
}

This project uses Node.JS, Javascript, and MongoDB. 该项目使用Node.JS,Javascript和MongoDB。 Currently I have a few thousand of these responses stored in a MongoDB, and I am interested in using some of the items in headers to detect platform changes. 目前,我在MongoDB中存储了数千个此类响应,并且我对使用headers某些items检测平台更改感兴趣。 Headers like server , x-powered-by , x-aspnet-version are all headers that in my opinion can be used to cross-referenced in the future. 诸如serverx-powered-byx-aspnet-version类的标头都是将来我可以用来交叉引用的标头。 For example - if a website "today" was upgraded from Microsoft-IIS/7.0 to Microsoft-IIS/7.5 when I run this scraper again in two months, there is reason to believe there was an upgrade with-in this website. 例如,如果我在两个月后再次运行此刮板时,“今天”的网站已从Microsoft-IIS/7.0升级到Microsoft-IIS/7.5 ,则有理由相信此网站已进行了升级。

My question is - what is the best way to do this? 我的问题是-最好的方法是什么?

Should I make two collections - collectionToday and collectionInTwoMonths ? 我应该创建两个collections- collectionTodaycollectionInTwoMonths吗?

Then do a regex search of integer changes/increments for each server , x-powered-by , and x-aspnet-version ? 然后,对每个serverx-powered-byx-aspnet-version的整数更改/增量进行正则表达式搜索吗?

How would an implementation of this work? 如何执行这项工作?

Any suggestions will be appreciated. 任何建议将不胜感激。

There are a few ways that you could do this. 有几种方法可以做到这一点。 One would be, as you suggested, creating different collections for each time period, and storing the entire group of headers for each one. 如您建议的那样,可以在每个时间段创建不同的集合,并为每个存储整个标题组。 You could then query for differences by running find for the url for each time period, comparing the results application side, and reporting the results. 然后,可以通过运行每个时间段的find网址,比较结果应用程序端并报告结果来查询差异。

Another way would be to store a "differences" collection, that held, for each point in time, the differences between the headers then and the headers the last time you queried. 另一种方法是存储一个“差异”集合,该集合针对每个时间点保留标题然后与上一次查询的标题之间的差异。 This would require more application logic each time you query for the headers, but would be less work when actually querying the differences. 每次查询标题时,这将需要更多的应用程序逻辑,但是在实际查询差异时会减少工作量。 This is what I would do. 这就是我要做的。

Edit 编辑

If those are the three headers you need, then I think that sounds good. 如果这是您需要的三个标题,那么我认为这听起来不错。 Remember that when you query to find the differences, you need to find the last time each header changed to compare against, which means the last entry (timewise) in the collection that both corresponds to the correct url and has an entry for the header in question. 请记住,当您查询以查找差异时,您需要查找每个标头更改的最后一次进行比较,这意味着集合中的最后一个条目(按时间顺序)既与正确的url相对应,又在其中包含标头的条目题。

Psuedo-code for diffing: 用于区分的伪代码:

for every url you want:
    query collection by url, sorting by date 
    for each header:
        find the last document with that field
        if the header value in that document and the current header are different:
            add the field to the new document
    add the new document, holding the url, date, and all different fields, to the collection

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM