简体繁体 English

有关数据处理（MapReduce / DHT？）框架的建议

[英]Recommendations for a data processing (MapReduce / DHT?) framework

原文 2009-11-30 11:00:59 6 2 mapreduce/ distributed-computing/ dht

I have a need to perform distributed searching across a largish set of small files (~10M) with each file being a set of key: value pairs. 我需要在一组较大的小文件（约10M）中执行分布式搜索，每个文件都是一组key: value对。 I have a set of servers with a total of 56 CPU cores available for this - these are mostly dual core and quad core, but also a large DL785 with 16 cores. 我有一组服务器，总共可使用56个CPU内核-这些服务器主要是双核和四核，但也有带有16核的大型DL785。

The system needs to be designed for online queries, I'm ideally looking to implement a web-service which returns JSON output on demand from a front-end. 该系统需要针对在线查询进行设计，理想情况下，我希望实现一种Web服务，该服务可根据需要从前端返回JSON输出。

To further complicate matters, for any particular search sometimes I'll only want to look at the latest version of each file, but other searches may only apply to those versions of files that existed on a particular date. 更复杂的是，对于任何特定的搜索，有时我只希望查看每个文件的最新版本，但是其他搜索可能仅适用于特定日期存在的那些文件版本。

I've looked at Hadoop, but the administration is pretty horrible, and the default job submission methods are slow. 我看过Hadoop，但是管理工作非常糟糕，默认的作业提交方法很慢。 It appears to be designed for offline very large scale processing, and not for online data processing. 它似乎是为离线超大规模处理而设计的，而不是为在线数据处理而设计的。

CouchDB looks nice as a document store and knows about key: value style documents and versioning and MapReduce, but I can't find anything about how it can be used as a distributed MapReduce system. CouchDB作为文档存储看起来不错，并且了解以下key: value样式文档和版本控制以及MapReduce，但是我找不到关于如何将其用作分布式 MapReduce系统的任何信息。 All of the clustering documentation talks about using clustering and replication of the entire database for load-balancing , whereas what I need is load-distribution . 所有集群文档都讨论了如何使用集群和整个数据库的复制来实现负载均衡 ，而我需要的是负载分配 。

I've also investigated various DHTs, and whilst they're fine for actually storing and retrieving individual records, they're generally poor at doing the 'map' part of MapReduce. 我还研究了各种DHT，尽管它们非常适合实际存储和检索单个记录，但它们通常不擅长MapReduce的“地图”部分。 Iterating over the complete document set is crucial. 遍历整个文档集至关重要。

Hence my ideal system would comprise a distributed file system like Hadoop's HDFS, with the web-service capabilities of CouchDB. 因此，我的理想系统将包括具有CouchDB Web服务功能的分布式文件系统（如Hadoop的HDFS）。

Can anyone point me in the direction of anything that might help? 谁能指出我可能会有所帮助的方向？ Implementation language isn't too much of a concern, except that it must run on Linux. 除了必须在Linux上运行之外，实现语言不是太在乎。

2 个解决方案

It seems like the problem domain would be better suited to a solution like Solr. 似乎问题域更适合于Solr之类的解决方案。 Solr offers http interfaces to other applications, even JSON . Solr为其他应用程序甚至JSON提供了http接口。 You could partition the search across multiple machines or distribute a single copy across machines for load balancing(master/slave). 您可以在多台计算机上划分搜索范围，也可以在多台计算机上分配单个副本以实现负载平衡（主/从）。 It would depend on what worked best for your data. 这将取决于哪种数据最适合您。 But in my experience for real-time search results, Lucene/Solr is going to beat any system based on a map/reduce system. 但是以我的实时搜索结果经验，Lucene / Solr将击败任何基于map / reduce系统的系统。

It's very simple to integrate Solr into an application and to do incremental updates. 将Solr集成到应用程序中并进行增量更新非常简单。 It doesn't really have any idea of versioning though. 它实际上并没有任何版本控制的想法。 If that's really necessary you might have to find another way to tack it on. 如果确实需要这样做，则可能必须找到另一种解决方法。

I may be a bit confused on what your application needs are, you mention needing to be able to search through key/value pairs, where Solr would be a great application. 我可能对您的应用程序需求有些困惑，您提到需要能够通过键/值对进行搜索，而Solr将是一个很好的应用程序。 But you also mention needing to use the map part of map/reduce and that you need to scan 10M documents. 但是您还提到需要使用map / reduce的map部分，并且需要扫描10M文档。 I'm not sure you're going to find a solution that will scan 10M documents and return results in an online fashion (in the millisecond range). 我不确定您是否会找到可以扫描1000万个文档并以在线方式（以毫秒为单位）返回结果的解决方案。 But another solution is too look at HBase . 但是另一种解决方案也是HBase 。 This builds on top of HDFS and allows you to run map reduce jobs of the type that you want, millions of smaller items. 它基于HDFS构建，并允许您运行所需类型的地图缩小作业，以及数百万个较小的物品。 But a job isn't going to be submittable and finish in anywhere near the time you're looking for. 但是，一项工作将无法提交，并且无法在您所需要的任何时间完成。

I currently have a test HBase set up with RSS items (2M items, several Kb per item). 我目前有一个测试HBase，其中设置了RSS项（200万个项，每个项几Kb）。 Total DB size is ~5Gb. DB的总大小约为5Gb。 There are several jobs that run against this DB scanning all of the items and then outputting results. 针对此数据库运行的作业有几个，它们扫描了所有项目，然后输出结果。 The cluster will scan items at ~5,000 / second, but it still takes around 10min to complete a job. 群集将以〜5,000 /秒的速度扫描项目，但完成工作仍需要10分钟左右。