简体   繁体   中英

What is the most effective way to find distinct of columns in hadoop

I have a file of size 1TB. And we need to find the distinct values for 4 columns in the file. So for example if we have columns A,B,C,D,E,F and so on. Among them we need to find all the distinct values in column A and create one file in HDFS. Similarly for B,C and D.

Note: We have to do this for only 4 Columns not for the remaining. There are total of 300 columns in the file.

We need to write Map Reduce for this. What would be an effective way to address this problem. Appreciate your help. Thanks.

Let the mapper output a record for each column you need the unique values. So in your example the map will (with a single input record) output 4 records with the key being A,B,C,D.

In the reducer you can then handles all the values.

Depending on the details of what you need you may want to use a key that looks something like this: "A:value of column A"

Basically you need to filter out duplicate records, this can be done in few steps begin from mapper, combiner and reducer. Also you can use java Set. Mapper can output key= 'Column A' Value='complete record'. Store keys in Set, if set contain key don't emit record. Same can be done in combiner. And may be you don't need reducer. Also need to workout so Set does not create out of memory error by clearing on some specific size.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM