简体   繁体   中英

Sort Order in HBase with Pig/Piglatin in Java

I created a HBase Table in the shell and added some data. In http://hbase.apache.org/book/dm.sort.html is written that the datasets are first sorted by the rowkey and then the column. So I tried something in the HBase Shell:

hbase(main):013:0> put 'mytable', 'key1', 'cf:c', 'val'
0 row(s) in 0.0110 seconds

hbase(main):011:0> put 'mytable', 'key1', 'cf:d', 'val'
0 row(s) in 0.0060 seconds

hbase(main):012:0> put 'mytable', 'key1', 'cf:a', 'val'
0 row(s) in 0.0060 seconds


hbase(main):014:0> get 'mytable', 'key1'
COLUMN                CELL                                                      
 cf:a                 timestamp=1376468325426, value=val                        
 cf:c                 timestamp=1376468328318, value=val                        
 cf:d                 timestamp=1376468321642, value=val                        
3 row(s) in 0.0570 seconds

Everything looks fine. I got the right order a -> c -> d like expected.

Now i tried the same with Apache Pig in Java:

pigServer.registerQuery("mytable_data = load 'hbase://mytable' using org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf', '-loadKey true') as (rowkey:chararray, columncontent:map[]);");
printAlias("mytable_data"); // own function, which itereate over the keys

I got this result:

(key1,[c#val,d#val,a#val])

So, now the order is c -> d -> a. That seems a little odd to me, shouldn't it be the same like in HBase? It's important for me to get the right order because I transform the map afterwards into a bag and then join it with other tables. If both inputs are sorted I could use a merge join without sorting these to datasets?! So does anyone now how it is possible to get the sorted map (or bag) of the columns?

You're fundamentally misunderstanding something -- the HBaseStorage backend loads each row as a single Tuple . You've told Pig to load the column family cf as a map:[] , which is exactly what Pig is doing. A Pig map under the hood is just a java.util.HashMap , which obviously has no order.

There is no way currently in pig to convert the map to a bag , but that should be a trivial UDF to write, barring the null checks and other boilerplate, the body is something like

public DataBag exec(Tuple input) {
    DataBag resultBag = bagFactory.newDefaultBag();
    HashMap<String, Object> map = (HashMap<String, Object>) input.get(0);
    for (Map.Entry<String, Object> entry : map) {
        Tuple t = tupleFactory.newTuple();
        t.append(entry.getKey());
        t.append(entry.getValue().toString());
        resultBag.add(t);
    }
    return resultBag;
}

With that then you can generate a bag{(k:chararray, v:chararray)} , use FLATTEN to get a list of (k:chararray, v:chararray) and ORDER those by k .

As for whether there is a way to get the data sorted -- generally no. If the amount of fields in the column family is not constant or the fields are not always the same / defined, your only options are

  • transforming the map to a bag of tuples and sorting then
  • or writing a custom LoadFunc which takes a table, a column family and emits a tuple per KeyValue pair scanned. HBase will ensure the ordering and give you the data in the sorted order you see in the shell, but note that the order is only guaranteed upon loading. Any further transformation you apply ruins that.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM