简体   繁体   中英

Querying Hbase efficiently

I'm using Java as a client for querying Hbase.

My Hbase table is set up like this:

ROWKEY     |     HOST     |     EVENT
-----------|--------------|----------
21_1465435 | host.hst.com |  clicked
22_1463456 | hlo.wrld.com |  dragged
    .             .             .
    .             .             .
    .             .             .

The first thing I need to do is get a list of all ROWKEYs which have host.hst.com associated with it.

I can create a scanner at Column host and for each row value with column value = host.hst.com I will add the corresponding ROWKEY to the list. Seems pretty efficient. O(n) for getting all rows.

Now is the hard part. For each ROWKEY in the list, I need to get the corresponding EVENT .

If I use a normal GET command to get the cell at (ROWKEY, EVENT) , I believe a scanner is created at EVENT which takes O(n) time to find the correct cell and return the value. Which is pretty bad time complexity for each individual ROWKEY . Combining the two gives us O(n^2) .

Is there a more efficient way of going about this?

Thanks a lot for any help in advance!

What is your n here?? With the RowKey in hand - I presume you mean the HBase rowkey - not some handcrafted one?? - that is fast/easy for HBase. Consider that to be O(1).

If instead the ROWKEY is an actual column you created .. then there is your issue. Use the HBase provided rowkey instead.

So let's move on - assuming you either (a) already properly use the hbase provided rowkey - or have fixed your structure to do so.

In that case you can simply create a separate get for each (rowkey, EVENT) value as follows:

Perform a `get` with the given `rowkey`. 
In your result then filter out EVENT in <yourEventValues for that rowkey>

So you will end up fetching all recent (latest timestamp) entries for the given rowkey. This is presumably small compared to 'n' ?? Then the filtering is a fast operation on one column.

You can also speed this up by doing a batched multiget . The savings comes from reduced round trips to the HBase master and parsings/plan generation by the master/region servers.

Update Thanks to the OP: I understand the situation more clearly. I am suggesting to simply use the "host | " as the rowkey. Then you can do a Range Scan and obtain the entries from a single Get / Scan.

Another update

HBase supports range scans based on prefixes of the rowkey. So you have foobarRow1, foobarRow2, .. etc then you can do a range scan on (foobarRow, foobarRowz) and it will find all of the rows that have rowkeys starting with foobarRow - and with any alphanumeric characters following.

Take a look at this HBase (Easy): How to Perform Range Prefix Scan in hbase shell

Here is some illustrative code:

SingleColumnValueFilter filter = new SingleColumnValueFilter(
   Bytes.toBytes("columnfamily"),
   Bytes.toBytes("storenumber"),
   CompareFilter.CompareOp.NOT_EQUAL,
   Bytes.toBytes(15)
);
filter.setFilterIfMissing(true);
Scan scan = new Scan(
   Bytes.toBytes("20110103-1"),
   Bytes.toBytes("20110105-1")
);
scan.setFilter(filter);

Notice that the 20110103-1 and 20110105-1 provide a range of rowkeys to search.

First thing is, your rowkey design should be perfect based on which you can define your access pattern to query.

1) Get is good if you know which rowkeys you can acccess upfront

In that case you can use method like below , it will return array of Result.

/**
     * Method getDetailRecords.
     * 
     * @param listOfRowKeys List<String>
     * @return Result[]
     * @throws IOException
     */
    private Result[] getDetailRecords(final List<String> listOfRowKeys) throws IOException {
        final HTableInterface table = HBaseConnection.getHTable(TBL_DETAIL);
        final List<Get> listOFGets = new ArrayList<Get>();
        Result[] results = null;
        try {
            for (final String rowkey : listOfRowKeys) {// prepare batch of get with row keys
   // System.err.println("get 'yourtablename', '" + saltIndexPrefix + rowkey + "'");
                final Get get = new Get(Bytes.toBytes(saltedRowKey(rowkey)));
                get.addColumn(COLUMN_FAMILY, Bytes.toBytes(yourcolumnname));
                listOFGets.add(get);
            }
            results = table.get(listOFGets);

        } finally {
            table.close();
        }
        return results;
    }

2)

In my experience with Hbase Scan performance is bit low if we dont have perfect rowkey design. I recommend if you are opting for scan for the above mentioned scenario by you.

FuzzyRowFilter(see hbase-the-definitive) This is really useful in our case We have used bulk clients like map-reduce as well as standalone hbase clients

This filter acts on row keys, but in a fuzzy manner. It needs a list of row keys that should be returned, plus an accompanying byte[] array that signifies the importance of each byte in the row key. The constructor is as such:

FuzzyRowFilter(List<Pair<byte[], byte[]>> fuzzyKeysData)

The fuzzyKeysData specifies the mentioned significance of a row key byte, by taking one of two values:

0 Indicates that the byte at the same position in the row key must match as-is. 1 Means that the corresponding row key byte does not matter and is always accepted.

Example: Partial Row Key Matching A possible example is matching partial keys, but not from left to right, rather somewhere inside a compound key. Assuming a row key format of _, with fixed length parts, where is 4, is 2, is 4, and is 2 bytes long. The application now requests all users that performed certain action (encoded as 99) in January of any year. Then the pair for row key and fuzzy data would be the following:

row key "???? 99 ????_01", where the "?" is an arbitrary character, since it is ignored. fuzzy data = "\\x01\\x01\\x01\\x01\\x00\\x00\\x00\\x00\\x01\\x01\\x01\\x01\\x00\\x00\\x00" In other words, the fuzzy data array instructs the filter to find all row keys matching "???? 99 ????_01", where the "?" will accept any character.

An advantage of this filter is that it can likely compute the next matching row key when it comes to an end of a matching one. It implements the getNextCellHint() method to help the servers in fast-forwarding to the next range of rows that might match. This speeds up scanning, especially when the skipped ranges are quite large. Example 4-12 uses the filter to grab specific rows from a test data set.

Example filtering by column prefix

List<Pair<byte[], byte[]>> keys = new ArrayList<Pair<byte[], byte[]>>();
keys.add(new Pair<byte[], byte[]>(
  Bytes.toBytes("row-?5"), new byte[] { 0, 0, 0, 0, 1, 0 }));
Filter filter = new FuzzyRowFilter(keys);

Scan scan = new Scan()
  .addColumn(Bytes.toBytes("colfam1"), Bytes.toBytes("col-5"))
  .setFilter(filter);
ResultScanner scanner = table.getScanner(scan);
for (Result result : scanner) {
  System.out.println(result);
}
scanner.close();

The example code also adds a filtering column to the scan, just to keep the output short:

Adding rows to table... Results of scan:

keyvalues={row-05/colfam1:col-01/1/Put/vlen=9/seqid=0,
           row-05/colfam1:col-02/2/Put/vlen=9/seqid=0,
           ...
           row-05/colfam1:col-09/9/Put/vlen=9/seqid=0,
           row-05/colfam1:col-10/10/Put/vlen=9/seqid=0}
keyvalues={row-15/colfam1:col-01/1/Put/vlen=9/seqid=0,
           row-15/colfam1:col-02/2/Put/vlen=9/seqid=0,
           ...
           row-15/colfam1:col-09/9/Put/vlen=9/seqid=0,
           row-15/colfam1:col-10/10/Put/vlen=9/seqid=0}

The test code wiring adds 20 rows to the table, named row-01 to row-20. We want to retrieve all the rows that match the pattern row-?5, in other words all rows that end in the number 5. The output above confirms the correct result.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM