简体繁体中英

Sorting a huge text file using hadoop

原文 2013-02-15 14:21:21 1 3 sorting/ hadoop/ mapreduce/ cloudera

Is it possible to sort a huge text file lexicographically using a mapreduce job which has only map tasks and zero reduce tasks?

The records of the text file is separated by new line character and the size of the file is around 1 Terra Byte.

It will be great if any one can suggest a way to achieve sorting on this huge file.

3 answers

Used TreeSet in Map method to hold entire data in the input split and persisted it. Finally I got the sorted file!

There is in fact a sort example that is bundled with Hadoop. You can look at how the example code works by examining the class org.apache.hadoop.examples.Sort . This itself works pretty well, but if you want more flexibility with your sort, you can check this out.

Sorting in Hadoop is done using a Partitioner - you can write a custom partitioner to sort according to your business logic needs. Please see this link on writing a custom partitioner http://jugnu-life.blogspot.com/2012/05/custom-partitioner-in-hadoop.html

I do not advocate sorting terabytes of data using plain vanilla linux sort commands - you will need to split the data to fit into memory to sort large file sizes: Parallel sort in linux

Its better and more expedient to use Hadoop MergeSort instead: Hadoop MergeSort

You can look at some Hadoop sorting benchmarks and analysis from the Yahoo Hadoop team (now Hortonworks) here : Hadoop Sort benchmarks

External sorting text lines in a huge file lexicographically using C++

Sorting a huge text file and doing a binary search

Sorting a huge file in Java

Sorting huge file for Python

sorting huge file that is almost sorted

Sorting a huge file in desired order

Sorting text file by using Python

Sorting lines of a text file using column with integers -

Sorting a text file

Sorting text file

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question External sorting text lines in a huge file lexicographically using C++ Sorting a huge text file and doing a binary search Sorting a huge file in Java Sorting huge file for Python sorting huge file that is almost sorted Sorting a huge file in desired order Sorting text file by using Python Sorting lines of a text file using column with integers - Sorting a text file Sorting text file

Related Tags

Sorting a huge text file using hadoop

Question

3 answers

solution1
3 2013-02-20 08:30:26

solution2
2 2013-02-15 22:18:05

solution3
0 2013-02-15 14:27:07

Sorting a huge text file using hadoop

Question

3 answers

solution1 3 2013-02-20 08:30:26

solution2 2 2013-02-15 22:18:05

solution3 0 2013-02-15 14:27:07

solution1
3 2013-02-20 08:30:26

solution2
2 2013-02-15 22:18:05

solution3
0 2013-02-15 14:27:07