简体繁体中英

What is the most effective way to find distinct of columns in hadoop

原文 2013-02-23 21:11:14 1 2 java/ hadoop/ mapreduce

I have a file of size 1TB. And we need to find the distinct values for 4 columns in the file. So for example if we have columns A,B,C,D,E,F and so on. Among them we need to find all the distinct values in column A and create one file in HDFS. Similarly for B,C and D.

Note: We have to do this for only 4 Columns not for the remaining. There are total of 300 columns in the file.

We need to write Map Reduce for this. What would be an effective way to address this problem. Appreciate your help. Thanks.

2 answers

Let the mapper output a record for each column you need the unique values. So in your example the map will (with a single input record) output 4 records with the key being A,B,C,D.

In the reducer you can then handles all the values.

Depending on the details of what you need you may want to use a key that looks something like this: "A:value of column A"

Basically you need to filter out duplicate records, this can be done in few steps begin from mapper, combiner and reducer. Also you can use java Set. Mapper can output key= 'Column A' Value='complete record'. Store keys in Set, if set contain key don't emit record. Same can be done in combiner. And may be you don't need reducer. Also need to workout so Set does not create out of memory error by clearing on some specific size.

what is the most effective way to persist HashMap?

What is the most effective way to find a match between the relevant parts of user input and an array?

Most effective way to find permutation in Java [with a lower asymptotic complexity]

What is the most effective way to create BigInteger instance from int value?

What's the most effective way to refactor this simple method?

What is the most effective way to load google maps markers from server

What is the fastest/most effective way to create a “new ArrayList” in IntelliJ IDEA

What is the most effective way to make unzipped copy of a gzipped file on GCS?

What is the most effective way to read names of a lot of files?

What is the most effective way of printing Arraylist of string array in Jsp?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Tags

What is the most effective way to find distinct of columns in hadoop

Question

2 answers

solution1
1 ACCPTED 2013-02-24 07:02:23

solution2
0 2013-02-25 15:46:50

What is the most effective way to find distinct of columns in hadoop

Question

2 answers

solution1 1 ACCPTED 2013-02-24 07:02:23

solution2 0 2013-02-25 15:46:50

solution1
1 ACCPTED 2013-02-24 07:02:23

solution2
0 2013-02-25 15:46:50