简体   繁体   中英

Complex regular expression with asyc HBase Scanner

I have started using Async HBase library. I'm trying to use TableInputFormat. I do not need all the rows for my Map Reduce job, so I modified the code by specifying the regular expression for the scanner object in TableRecordReader.

String regEx=getRegEx(conf); //my function to calculate the regular expression based on the input given in the conf object
System.out.println("RegEx = "+regEx);
scanner.setKeyRegexp(regEx);

Basically I just append required rows' key in the regular expression with an OR ( | ). This works when I want to fetch few hundreds of rows. In some scenarios my regular expression is very lengthy (around 600,000) when I want to fetch more rows. But in this case the Scanner stops working.

I'm aware that the filtering of the row keys based on the regular expression is done in the server side and complex regular expression may not work.

  • So what can be done to make the scanner fetch only the required rows?
  • Is it efficient to use more than one scanner, so that each can be given a part of the regular expression?
  • Or is it efficient to use a single scanner to get all the rows and then iterate through them for the required rows??.

FYI : The total number of rows in my table will be in the range of tens of millions.

That depends on the length of your rows. You are most likely hitting the maximum string legnth, which in theory is 2,147,483,647 , but in reality is very much limited to the virtual memory Java gets.

Just for illustration purposes: 2,147,483,647 characters would require 4GB of dedicated memory just to hold the string. You might want to think about splitting the string up, making a string out of every 1000 or so rows, and using all those results to find the result you're looking for.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM