简体   繁体   中英

What is best approach creating multiple hbase tables or multiple column families in single hbase table

My hbase row key is different and also I need to aggregate the data and store seperatly. In this use case which one is best approach

What is best approach creating multiple hbase tables or multiple column families in single hbase table

I am Refining my question

Below is my usecase.

I am processing weblogs which has retailer, Category, Product clicks.

  1. I am storing above weblog into one hbase table (Log) with separate rowkey and same column family Ex.

    • A.

    for Retailer -- IP | DateTime | Sid | Retailer

    • B.

    for Category -- IP | DateTime | Sid | Retailer | Category

    • C.

    for Product -- IP | DateTime | Sid | Retailer | Category |Product

  2. From above table I am calculating Day clicks and storing into other hbase tables like ( Retailer_Day_cnt, Category_Day_Cnt, Product_Day_Cnt)

Here my question is what is the best way to store the data into hbase with above 1 and 2 cases, is it separate hbase tables or column family.

Note: In case1 I am doing only writes, but in case2 I will do multiple reads and writes.

Thanks in advance Surendra

From performance perspective, lesser the column families better it is. As all the column families in table are flushed at same time even if some of the column families have very little data, making flush less efficient. . If your table is heavy on write this will result lot hfiles -> increased in compactions -> increased GC pauses, this can make whole hbase very slow so better don't use multiple column family if you don't really need them or all column families will have same amount data.

Find more details here: Hbase Book

Similar question

This depends on you use case.

In case you have the same rowKey but different data then you can divide into different column families. But if the rowkeys are different put it into different tables.

This also will depend on whether you have single write multiple reads (ie low write throughput is ok) or you want high write throughput. Also how you data is dictributed. If one column family has a lot of data (in size) compared to rest of column families better to put the column families into different tables.

If you give more details on your use case i can be more specific.

Row key design is the main challenge in these scenarios. If you are able to make your row key in such a way so that you can use it for all of your purposes then you may proceed with different column families otherwise multiple tables would be the only option. For your case, it seems like you are storing aggregated result in the second table which must have different logical row key. So, you should go with two tables approach where first table to store all the inputs (write once read multiple times) and second table to store processed/aggregated data.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM