简体繁体中英

Cassandra vs HBase for Hadoop jobs

原文 2015-11-05 09:56:26 5 1 hadoop/ cassandra/ hbase

What are advantages of Cassandra over HBase when it comes to MapReduce jobs?

I have a lot of small files that I would like to move from HDFS to a database and that files would be input for MapReduce jobs. I don't take all files, but for a certain user, so possibly the whole row, at least a column family. I could take files from certain period.

I know that HBase is the Hadoop database , so I expect that integrates good for what I need, but I also read that Cassandra has much better performance. But I would like to know what is the situation when you use it as input for MapReduce jobs. Is the performance still a lot better than in case of HBase?

I must emphasize that I'm not looking for comparison of HBase and Cassandra in general, but in concrete case of MapReduce jobs. Questions like this do not talk concretely about performance in case of MapReduce jobs. Also, I'm looking for fresh information (the question I mentioned is from 2011, I believe there might have been some changes since then).

1 answers

Both databases have a great read and write performance. Possibly HBase for bulk reading has a slightly better performances, than Cassandra. But I have two use cases when HBase will work significant faster than Cassandra, due to it design.

First when you need for map reduce only some portion of data based on the column names, eg a html pages and some parsed information from it. You put html in one column family, the parsed information in other. The different column families lie in different files in HDFS, so to read only one you will don't need to read other. This gives you significant benefits in performance because, in case when you will need read only parsed data, which a occupied several times less space on disck than html. In case of Cassandra you will need read whole table.

Second when you need access information ordered by row key or some part of table based on this order, eg . read html page from some domain. In case of HBase you can have a row key as sum of domain and url. HBase have a good balancer for cases of unhashed row keys. But Cassandra have not or you should use some trick for balancing in this case, or will need to scan whole table.

Hope this use cases will give you some picture, when better to use HBase and when Cassandra.

How to use hbase as a source for hadoop streaming jobs

Why HBase is a better choice than Cassandra with Hadoop?

1 big Hadoop and Hbase cluster vs 1 Hadoop cluster + 1 Hbase cluster

cassandra and hadoop - realtime vs batch

elasticsearch vs hbase/hadoop for realtime statistics

realtime querying/aggregating millions of records - hadoop? hbase? cassandra?

Map only jobs in spark (vs hadoop streaming)

Hadoop Streaming job vs regular jobs?

Hadoop vs Cassandra: Which is better for the following scenario?

Hadoop and HBase

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How to use hbase as a source for hadoop streaming jobs Why HBase is a better choice than Cassandra with Hadoop? 1 big Hadoop and Hbase cluster vs 1 Hadoop cluster + 1 Hbase cluster cassandra and hadoop - realtime vs batch elasticsearch vs hbase/hadoop for realtime statistics realtime querying/aggregating millions of records - hadoop? hbase? cassandra? Map only jobs in spark (vs hadoop streaming) Hadoop Streaming job vs regular jobs? Hadoop vs Cassandra: Which is better for the following scenario? Hadoop and HBase

Related Tags

Cassandra vs HBase for Hadoop jobs

Question

1 answers

solution1 0 ACCPTED 2015-11-05 13:40:52

solution1
0 ACCPTED 2015-11-05 13:40:52