简体   繁体   中英

How do I diff two tables in HBase

I am trying to compare two different tables in HBase so that I can automate the validation of some ETL processes that we use to move data in HBase. What's the best way to compare two tables in HBase?

My use case is below:

What I am trying to do is create one table that will be my expected output. This table will contain all of the data that I am expecting to be created via executing the teams code against an input file. I will then take the diff between the actual output table and the expected output table to verify the integrity of the component under test.

I don't know of anything out of the box but you can write a multi-table map/reduce.

The mappers will just emit keys from each table (with a value being all the hbase key values plus a table name) The reducer can make sure it has 2 records of each key and compare the key-values. When there's only one key it can see which table is out of sync

I know this question is a little old, but how large are the tables? If they will both fit into memory you could load them into Pig using HBaseStorage, then use Pig's built in DIFF function to compare the resulting bags.

This will work even with large tables that don't fit into memory, according to the docs, but it will be extremely slow.

dataset1 = LOAD '/path/to/dataset1' USING PigStorage('<your delimiter>') AS (a:chararray, b:chararray, c:chararray, d:chararray);
dataset2 = LOAD '/path/to/dataset2' USING PigStorage('<your delimiter>') AS (a:chararray, b:chararray, c:chararray, d:chararray);

dataset3 = COGROUP dataset1 BY (a, b,c, d), dataset2 BY (a, b, c, d);

dataset4 = FOREACH dataset3 GENERATE DIFF(dataset1,dataset2);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM