简体   繁体   English

使用本机协议的Cassandra的自定义map-reduce输入格式化程序

[英]Custom map-reduce input formatter for Cassandra using native protocol

I am using Apache Cassandra (1.2) and Apache Map-Reduce to crunch some data. 我正在使用Apache Cassandra(1.2)和Apache Map-Reduce处理一些数据。 At the moment I use CqlPagingInputFormat from org.apache.cassandra.hadoop.cql3 . 目前,我使用org.apache.cassandra.hadoop.cql3中的 CqlPagingInputFormat This provider uses Thrift to pull data. 该提供程序使用Thrift提取数据。 It seems that Thrift is fairly slow (300M records, in a 3 node cluster takes 8+ hours to read), and since a native binary protocol exist, I wonder if anyone has used it. Thrift似乎很慢(300M记录,在3个节点的群集中需要8多个小时才能读取),并且由于存在本机二进制协议,我想知道是否有人使用过它。

I am not interested in any other optimization and configuration tweaks - that's a separate issue. 我对其他任何优化和配置调整都不感兴趣-这是一个单独的问题。

My questions are 我的问题是

  1. Is there an implementation of a map-reduce input formatter that directly use Cassandra native protocol? 是否存在直接使用Cassandra本机协议的map-reduce输入格式化程序的实现?

  2. If not, what would be the first steps to write my own, for example using a DataStax driver? 如果没有,那么编写自己的第一步是什么,例如使用DataStax驱动程序?

Cassandra 2.0.7 includes native protocol analogs for the CQL Hadoop classes: Cassandra 2.0.7包括用于CQL Hadoop类的本机协议类似物:

org.apache.cassandra.hadoop.cql3.CqlInputFormat org.apache.cassandra.hadoop.cql3.CqlRecordReader org.apache.cassandra.hadoop.cql3.CqlConfigHelper org.apache.cassandra.hadoop.cql3.CqlInputFormat org.apache.cassandra.hadoop.cql3.CqlRecordReader org.apache.cassandra.hadoop.cql3.CqlConfigHelper

The WordCount code in examples/hadoop_cql3_word_count has been updated to use these classes. 示例/ hadoop_cql3_word_count中的WordCount代码已更新为使用这些类。

The JIRA that introduced this is https://issues.apache.org/jira/browse/CASSANDRA-6311 引入此功能的JIRA是https://issues.apache.org/jira/browse/CASSANDRA-6311

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM