How to bucket a Hive table with ORC for a complex query?

Question

Maybe this question is too generic but I think it is worth a try.

I am working with a table that has 270 fields. It is partitioned by the date (like dt=20180101). However when we are hitting this table with queries we are essentially doing a whole table scan because we use fields in the where clause that are not dt. I was wondering what is the right approach for enable bucketing for this table. I could pick one of the where clause fields and enable bucketing for that. For example:

PARTITIONED BY (
  dt INT
)
CLUSTERED BY (
  class
)
INTO 16 BUCKETS

Another approach is to use more than 1 field for bucketing:

PARTITIONED BY (
  dt INT
)
CLUSTERED BY (
  class, other_field, other_field_2
)
INTO 128 BUCKETS

Is it worth to bucker by multiple field? I guess it will only speed up queries when the same exact fields are present in the select.

Another question, is it worth at least sort by multiple fields so when the file is read it is sequential read? Like this:

PARTITIONED BY (
  dt INT
)
CLUSTERED BY (
  class
)
SORTED BY (
  other_field, other_field_2
)
INTO 16 BUCKETS

Answer 1

First, if you dont usually query on date and your queries span over many dates, then you might want to change your partitioning strategy. Its not necessary that you will always query only for 1 or few dates but if your queries are usually totally NOT related to 'date' filtering then you should change that!

Second, bucketing basically splits your data based on hash of your bucketing columns. So it helps you to split your data into equally sized folders in file system and helps mapReduce program runnig over it manage the partitions in an efficient way. But, bucketing into large number of buckets can also have negative effects as all such metadata is also stored in Hive metastore. So, this metadata is read first when you execute some query and based on the result from metadata query, actual data (part of actual data) is read from file system. So in actual there's no specific rule for bucketing; as to how many buckets should be there and on what all columns you should bucket.

So you should look into your queries and plan accordingly!

Third, sorting does help at the time of querying, as its easy for the engine to push down filtering and sorting criteria. But when you enable sorting on a table, ingestion of data actually becomes a little slower than the case where sorting isnt enabled! But definitely in high queries system it is bound to get you good benefits.

So all in all, these three are all optimization techniques and dont hold any particular rules for their application. It purely depends on your use case!

Hope this helps!!

How to bucket a Hive table with ORC for a complex query?

Question

1 answers

solution1
1 ACCPTED 2018-10-23 12:25:49

How to bucket a Hive table with ORC for a complex query?

Question

1 answers

solution1 1 ACCPTED 2018-10-23 12:25:49

solution1
1 ACCPTED 2018-10-23 12:25:49