简体繁体 English

使用本机协议的Cassandra的自定义map-reduce输入格式化程序

[英]Custom map-reduce input formatter for Cassandra using native protocol

原文 2014-04-21 12:45:22 8 1 java/ hadoop/ mapreduce/ cassandra/ datastax-java-driver

I am using Apache Cassandra (1.2) and Apache Map-Reduce to crunch some data. 我正在使用Apache Cassandra（1.2）和Apache Map-Reduce处理一些数据。 At the moment I use CqlPagingInputFormat from org.apache.cassandra.hadoop.cql3 . 目前，我使用org.apache.cassandra.hadoop.cql3中的 CqlPagingInputFormat 。 This provider uses Thrift to pull data. 该提供程序使用Thrift提取数据。 It seems that Thrift is fairly slow (300M records, in a 3 node cluster takes 8+ hours to read), and since a native binary protocol exist, I wonder if anyone has used it. Thrift似乎很慢（300M记录，在3个节点的群集中需要8多个小时才能读取），并且由于存在本机二进制协议，我想知道是否有人使用过它。

I am not interested in any other optimization and configuration tweaks - that's a separate issue. 我对其他任何优化和配置调整都不感兴趣-这是一个单独的问题。

My questions are 我的问题是

Is there an implementation of a map-reduce input formatter that directly use Cassandra native protocol? 是否存在直接使用Cassandra本机协议的map-reduce输入格式化程序的实现？
If not, what would be the first steps to write my own, for example using a DataStax driver? 如果没有，那么编写自己的第一步是什么，例如使用DataStax驱动程序？

1 个解决方案

Cassandra 2.0.7 includes native protocol analogs for the CQL Hadoop classes: Cassandra 2.0.7包括用于CQL Hadoop类的本机协议类似物：

org.apache.cassandra.hadoop.cql3.CqlInputFormat org.apache.cassandra.hadoop.cql3.CqlRecordReader org.apache.cassandra.hadoop.cql3.CqlConfigHelper org.apache.cassandra.hadoop.cql3.CqlInputFormat org.apache.cassandra.hadoop.cql3.CqlRecordReader org.apache.cassandra.hadoop.cql3.CqlConfigHelper

The WordCount code in examples/hadoop_cql3_word_count has been updated to use these classes. 示例/ hadoop_cql3_word_count中的WordCount代码已更新为使用这些类。

The JIRA that introduced this is https://issues.apache.org/jira/browse/CASSANDRA-6311 引入此功能的JIRA是https://issues.apache.org/jira/browse/CASSANDRA-6311