简体   繁体   English

爬行版本控制系统

[英]Crawling Version Control System

I want to crawl some kind of project on GitHub say I want to crawl source code which are created by particular author and bla bla constraints. 我想在GitHub上进行某种项目的爬网,说我想对由特定作者和bla bla约束创建的源代码进行爬网。 Is there any plugin for Nutch to crawl this information or best way to get the whole repositories crawled. Nutch是否有任何插件可以爬网此信息,还是可以对整个存储库进行爬网的最佳方法。

I even want to crawl version of publicly hosted version control system using Nutch. 我什至要使用Nutch来抓取公共托管版本控制系统的版本。 Is there any plugin available for the same. 是否有任何可用的插件。

Github comes with a JSON API. Github带有JSON API。 Use the repository API to get the list of repositories for a specific user and then clone them. 使用存储库API获取特定用户的存储库列表,然后克隆它们。 Should be a matter of a few lines shell. 应该是几行外壳的问题。

See the API documentation here . 请参阅此处的API文档。

Nutch is a search engine, made by Apache, based on a Lucene backend. Nutch是Apache的一种基于Lucene后端的搜索引擎。

Take a look at github's robots.txt file: https://github.com/robots.txt 看一下github的robots.txt文件: https : //github.com/robots.txt

Apart from specific engines, (eg google), it says: 除了特定的引擎(例如google),它还表示:

User-agent: *
Disallow: /

Therefore you cannot crawl GitHub with Nutch. 因此,您无法使用Nutch爬网GitHub。

Crawling github with a search engine seems like a bad idea. 用搜索引擎爬行github似乎是个坏主意。 There will be a lot of similar pages that you would be downloading for no reason. 您将无故下载许多类似的页面。 What's wrong with GitHub's search? GitHub的搜索有什么问题?

Please try to generalise your question. 请尝试概括您的问题。 What do you hope to achieve by crawling github with Nutch? 您希望通过使用Nutch爬行github来实现什么? What kind of searches are you wanting to perform? 您要执行哪种搜索?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM