简体繁体 English

Apache Nutch，HBase，Hadoop，Solr，Gora中的混乱

[英]Confusion in Apache Nutch, HBase, Hadoop, Solr, Gora

原文 2015-03-26 19:26:43 5 1 hadoop/ solr/ hbase/ nutch/ apache-zookeeper

I am new to all these terms and given some time to understand it. 我对所有这些术语都是陌生的，并花了一些时间来理解它。 But i have some confusions in it. 但是我有些困惑。 Please correct me if i am wrong. 如果我错了，请纠正我。

Nutch: It's for web crawling, using it we can crawl web pages. Nutch：用于网络爬网，使用它我们可以爬网网页。 We can store these web pages somewhere in db. 我们可以将这些网页存储在db中的某个位置。

Solr: Solr can be used for indexing web pages crawled by Apache Nutch. Solr：Solr可用于索引Apache Nutch爬行的网页。 It helps in searching the indexes web pages. 它有助于搜索索引网页。

HBase: It's used as an interface to interact with Hadoop. HBase：用作与Hadoop交互的接口。 It helps in getting data at real time from HDFS. 它有助于从HDFS实时获取数据。 It provides simple SQL type interface for interacting. 它提供了用于交互的简单SQL类型接口。

Hadoop: It provides two functionalities: One is HDFS (Hadoop data file system) and other is Map-Reduce functionality taken from Google algorithms. Hadoop：它提供两种功能：一种是HDFS（Hadoop数据文件系统），另一种是采用Google算法提供的Map-Reduce功能。 Its basically used for offline data backup etc. 它基本上用于离线数据备份等。

Gora and ZooKeeper: I am not sure of. Gora和ZooKeeper：我不确定。

Confusions: 1). 困惑：1）。 Is HBase a key-value pair DB or just an interface to Hadoop ? HBase是键值对数据库还是仅仅是Hadoop的接口？ or i should ask, can HBase exist without Hadoop ? 或者我应该问，没有Hadoop的HBase是否可以存在？ If yes, can you explain a bit more about its usage. 如果是，您能否进一步解释其用法。

2). 2）。 Is there any use of crawling data using Apache Nutch without indexing into Solr ? 在没有索引到Solr的情况下，使用Apache Nutch进行爬网数据有没有用？

3). 3）。 For running apache nutch, do we need HBase and Hadoop ? 为了运行apache，我们需要HBase和Hadoop吗？ If no, how we can make it work without it? 如果没有，我们如何使它不起作用？

4). 4）。 Is Hadoop part of HBase ? Hadoop是HBase的一部分吗？

1 个解决方案

Here is a good short discussion of HBase vs. Hadoop: Difference between HBase and Hadoop/HDFS 这是关于HBase与Hadoop的简短讨论： HBase与Hadoop / HDFS之间的区别

Because HBase is built on top of Hadoop you can't really have HBase without Hadoop. 因为HBase是基于Hadoop构建的，所以没有Hadoop就无法真正拥有HBase。

Yes you can run Nutch without Solr; 是的，您可以在没有Solr的情况下运行Nutch； there do not seem to be lots of use cases, however, much less living examples in the wild. 似乎没有很多用例，但是，很少有活用的例子。
Yes, you can run Nutch without Hadoop, but again there don't seem to be a lot of real-world examples of people doing this. 是的，您可以在不使用Hadoop的情况下运行Nutch，但同样，在现实世界中，似乎没有很多人这样做。
Yes Hadoop is part of HBase, in that there is no HBase without Hadoop, but of course Hadoop is used for other things as well. 是的，Hadoop是HBase的一部分，因为没有Hadoop，就没有HBase，但是Hadoop当然也用于其他用途。

Zookeeper is used for configuration, naming, synchronization, etc. in Hadoop stack workflows. Zookeeper用于Hadoop堆栈工作流程中的配置，命名，同步等。 Gora is a memory management/persistence framework and is built on top of Hadoop. Gora是一个内存管理/持久性框架，建立在Hadoop之上。