简体繁体 English

Solr HBase搜索引擎

[英]Solr HBase search engine

原文 2015-08-07 00:12:27 5 2 hadoop/ search/ solr/ hbase/ hdfs

I need to use SolrCloud as the search engine on top of HBase and HDFS for searching a very large num of documents. 我需要使用SolrCloud作为HBase和HDFS之上的搜索引擎来搜索大量文档。

Currently these docs are in different data sources. 目前，这些文档位于不同的数据源中。 I am getting confused whether Solr should search, index and store these docs within itself or Solr should just be used for indexing and docs along with their metadata of the docs should reside in HBAse/HDFS layer. 我感到困惑的是，Solr应该在自己内部搜索，索引和存储这些文档，还是应该仅将Solr用于索引，并将文档及其文档元数据放置在HBAse / HDFS层中。

I have tried searching how the Solr HBase integration works best (meaning what should be done at the Solr level and what at the Hadoop level) but in vain. 我曾尝试搜索Solr HBase集成如何最好地工作（这意味着应该在Solr级别执行什么操作，在Hadoop级别执行什么操作），但是徒劳。 Anyone has done this kind of Big Data search earlier and can give some pointers? 任何人都已经做过这种大数据搜索，可以提供一些指示吗？ Thanks 谢谢

2 个解决方案

Solr provides fast search via its indexes. Solr通过其索引提供快速搜索。 Solr uses inverted indexes for this. Solr为此使用了倒排索引 。 So, you index documents to solr, it creates the indexes. 因此，您将文档编制索引以进行solr，它将创建索引。 Based on how you have defined the schema.xml, solr decides how the indexes has to be created. 根据您如何定义schema.xml，solr决定如何创建索引。 The indexes and the field values are stored in HDFS (based on your config in solrconfig.xml ) 索引和字段值存储在HDFS中（基于solrconfig.xml中的配置）

With respect to Hbase, you can directly query run you query on hbase without having to use Solr. 关于Hbase，您可以直接查询运行在hbase上的查询，而不必使用Solr。 SolrBase is an SOLR and Hbase integration available. SolrBase是可用的SOLR和Hbase集成。 Also have a look at liliy 也看看liliy

The good design followed is search for things in solr, get the id of the records quickly, and then if needed, fetch the entire record from Hbase. 遵循的良好设计是在solr中搜索内容，快速获取记录的ID，然后根据需要从Hbase获取整个记录。 You need to make sure that entire data is there in hbase, and only sufficient data is indexed. 您需要确保hbase中存在全部数据，并且仅索引了足够的数据。 Needless to say that both solr and hbase should be in sync. 不用说，solr和hbase应该同步。 One ready made framework, is NGDATA/hbase indexer here . 一个现成的框架，是NGDATA / HBase的索引位置。

Solr works wonders to get the counts, grouping counts, stats. Solr创造奇迹来获得计数，分组计数，统计数据。 So once you get those numbers and their id's, Hbase can take over. 因此，一旦获得这些数字及其ID，Hbase就可以接管。 once u have row key in hbase(id), you get low latency search results, that suits well with web applications too 一旦在hbase（id）中具有行键，您将获得低延迟的搜索结果，也非常适合Web应用程序