简体   繁体   中英

Cassandra vs HBase for Hadoop jobs

What are advantages of Cassandra over HBase when it comes to MapReduce jobs?

I have a lot of small files that I would like to move from HDFS to a database and that files would be input for MapReduce jobs. I don't take all files, but for a certain user, so possibly the whole row, at least a column family. I could take files from certain period.

I know that HBase is the Hadoop database , so I expect that integrates good for what I need, but I also read that Cassandra has much better performance. But I would like to know what is the situation when you use it as input for MapReduce jobs. Is the performance still a lot better than in case of HBase?

I must emphasize that I'm not looking for comparison of HBase and Cassandra in general, but in concrete case of MapReduce jobs. Questions like this do not talk concretely about performance in case of MapReduce jobs. Also, I'm looking for fresh information (the question I mentioned is from 2011, I believe there might have been some changes since then).

Both databases have a great read and write performance. Possibly HBase for bulk reading has a slightly better performances, than Cassandra. But I have two use cases when HBase will work significant faster than Cassandra, due to it design.

First when you need for map reduce only some portion of data based on the column names, eg a html pages and some parsed information from it. You put html in one column family, the parsed information in other. The different column families lie in different files in HDFS, so to read only one you will don't need to read other. This gives you significant benefits in performance because, in case when you will need read only parsed data, which a occupied several times less space on disck than html. In case of Cassandra you will need read whole table.

Second when you need access information ordered by row key or some part of table based on this order, eg . read html page from some domain. In case of HBase you can have a row key as sum of domain and url. HBase have a good balancer for cases of unhashed row keys. But Cassandra have not or you should use some trick for balancing in this case, or will need to scan whole table.

Hope this use cases will give you some picture, when better to use HBase and when Cassandra.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM