简体   繁体   中英

Sorting Algorithm on hadoop framework

I read numbers of links on internet. Here are few links link1 , link2 . But I am not able to understand. What they exactly doing. Can you pleae explain this algorithm in a simpler way.

And, yes next Question, I have one approach in my mind. Tell me wheather it is correct or not.

Algorithm -

Divide the whole numbers in between mappers.

Mapper - All mappers use basic approach (Any standard sorting algorithm, There is no use of concept here).

Reducer - When all mappers are done with their task. Create a min-heap having nodes equal to number of mappers. Use this min-heap to sort the whole data. (It is easy to sort numbers of sorted lists using min-heap approach).

Is the above algorithm correct ?

Yes you are right.

The mappers sort with a hybrid sort of Quicksort and Heapsort.

The reducers only do an n-way merge of the sorted output of the mappers.

The links you provide are for TeraSort, so I'll try to very briefly explain it. Though the other answer is kinda correct in that some Hadoop sorting algorithms use a combination of quicksort and -Heapsort- mergesort (I'm pretty sure you mean mergesort, not heapsort).

I think very briefly TeraSort goes like this:

  • suppose we wish to sort:

2346246 7245242 8212345 1324623 4356234 9323244

then you don't need to read the entire records to see which numbers are biggest, but just the first number. Keep this in mind when reading on.

  1. Sample the data to understand the distribution of the records - get a sense of the range.

  2. Select 2 bytes from each record - possibly the first 2 (like the eg)

  3. Bucket according to these 2 bytes ensuring the buckets are 'in order' so bucket n has records that are all less than records in bucket n + 1

  4. Sort each bucket

  5. TADA! You have your data sorted.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM