简体   繁体   中英

Hbase scan is returning deleted rows

I am using a SingleColumnValueFilter to return a list of rows that i want deleted:

SingleColumnValueFilter fileTimestampFilter = new SingleColumnValueFilter(
         Bytes.toBytes('a'),
         Bytes.toBytes('date'),
         CompareFilter.CompareOp.GREATER,
         Bytes.toBytes("20140101000000")
         );    

I then create a Delete object and delete each column.

Delete delete = new Delete(Bytes.toBytes(rowKey));
delete.deleteColumn(Bytes.toBytes('a'), Bytes.toBytes('date'));
htable.delete(delete);

The retrieval code is

private List<String> getRecordsToDelete(long maxResultSize)
{
  ResultScanner rs = null;
  HTableInterface table = null;
  List<String> keyList = new ArrayList<String>();
  try
  {
    log.debug("Retrieving records");      
    HbaseConnection hbaseConnectionConfig = myConfig.getHbaseConnection();
    Configuration configuration = getHbaseConfiguration(hbaseConnectionConfig);
    table = new HTable(configuration, 'mytable');
    FilterList list = new FilterList(FilterList.Operator.MUST_PASS_ALL);
    Filter filter = HbaseDao.getFilter();
    list.addFilter(filter);
    list.addFilter(new PageFilter(maxResultSize));
    Scan scan = new Scan();
    scan.setFilter(list);
    //scan.setMaxResultSize(maxResultSize);
    //scan.setCaching(1);
    //scan.setCacheBlocks(false);
    //log.debug("Scan raw? = " + scan.isRaw());
    //scan.setRaw(false);
    rs = table.getScanner(scan);      
    Iterator<Result> iterator = rs.iterator();      
    while (iterator.hasNext())
    {        
      Result result = iterator.next();        
      String key = Bytes.toString(result.getRow());
      log.debug("**************** f key = " + key); //the same keys are always added here
      keyList.add(key);        
    }
    log.debug("Done processing retrieval of records to delete Size = " + keyList.size());
  }
  catch (Exception ex)
  {
    log.error("Unable to process retrieval of records.", ex);
  }
  finally
  {
    try
    {
      if (table !=  null)
      {
        table.close();
      }
      if (rs != null)
      {
        rs.close();
      }
    }
    catch (IOException ioEx)
    {
      //do nothing
      log.error(ioEx);
    }
  }
  return keyList;
}

This task is scheduled, and when it runs again, it is retrieving the same rows. I understand that hbase marks rows for deletion, and then they are only physically deleted after a major compaction. If I query the row via hbase shell inbetween runs of my task, the column has definitely been deleted. Why is my Scan returning the same rows on subsequent runs of this task?

Thanks in advance!

It has nothing to do with major compactions (they run every ~24 hours by default). When you delete a row the deleted data will be ignored by HBase until finally removed (on major_compactions). Just be noticed that if you don't have the autoflush active you'll have to manually flush your client buffer first by calling htable.flushCommits() (autoflush=on by default).

Your problem it's probably caused because you're only deleting a:date and your row has more columns which are being read and they're passing the filter because that is the default behaviour if no value is present.


If you want to delete the whole row , just remove delete.deleteColumn(Bytes.toBytes('a'), Bytes.toBytes('date')); to delete the row, not just the column.


If you just want to delete the a:date column while keeping the rest of the row untouched , set the filterIfMissing flag to avoid rows with a:date == null going through (because it has been deleted): filter.setFilterIfMissing(true);

Or for best performance, add just that column to the scan, that will prevent other columns to be read: scan.addColumn(Bytes.toBytes('a'), Bytes.toBytes('date'));


On a side note, please notice that list.addFilter(new PageFilter(maxResultSize)); will retrieve maxResultSize results from each region of your table, you have to manually implement the limit within the iterator by breaking it when your keyList reach maxResultSize.

One more tip, when logging for debugging purposes, always log the full result in order to see exactly what's inside it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM