Internal working of sort/uniq command when input is piped or redirected to it

Question

I have been trying to understand the execution and internal data structures and algorithms involved in the below command when executed in Linux.

bzip -dc mybig.bz2 | cut -d ',' -f 1,2,4,5,9,10,12 | sort > output_file

When you pipe the output of one program to another where is the intermediate result stored?
- When we have sort/uniq like commands piped to accept input from the previous command, is sort capable of working in parallel with bzip2, can sort sort without having all the inputs at once?
- Since sort(gnu-coreutils) does merge-sort internally where are the intermediate results of merge sort sorted during execution, say if the file mybig.bz2 is 20GB in size how is sort managing all of the intermediate results in disk for such huge files ?
How do you compare the number of I/O operations, intermediate file size and cpu usuage of the following two different shell scripts(I'm looking for more of a theorectical reasoning than a benchmark result) ?

Using redirection and intermediate files.

bzip -dc mybig.bz2 > temp1
cut -d ',' -f 1,2,4,5,9,10,12 temp1 > temp2 
sort temp2 > output_file

Using pipes.

bzip -dc mybig.bz2 | cut -d ',' -f 1,2,4,5,9,10,12 | sort > output_file

Is there a better way to do this using shell, where cat , cut and sort is running in parallel (line buffered) and does minimum of disk I/O and cpu cycles ?

Any help highly appreciated.

Answer 1

Using redirection and intermediate files

bzip -dc mybig.bz2 > temp1
cut -d ',' -f 1,2,4,5,9,10,12 temp1 > temp2 
sort temp2 > output_file

Let us assume mybig.bz2 is 1 GB and the uncompressed version is 10 GB. The above will then:

read 1 and write 10 (bzip2 -> temp1)
read 10 and write 10 (cut, we assume the cut size is essentially the same)
read 10, write 10, read 10 and write 10 (sort uses temporary files for big sorts).

In total disk I/O of 1+10+10+10+10+10+10+10 = 71 GB.

Using pipes

bzip -dc mybig.bz2 | cut -d ',' -f 1,2,4,5,9,10,12 | sort > output_file

Here you:

Read 1 GB (bzip2 - data is not written to disk)
Read nothing from disk (cut keeps everything in memory)
Write 10 GB, read 10 GB and write 10 GB (sort reads first from memory, then saves to temporary files on disk, reads these back and writes the output)

In total disk I/O of 1+10+10+10 = 31 GB.

You waste nothing by using pipes. On the contrary if the bzip2 is the same speed as sorting, you can keep 2 CPU's running in parallel. Newer versions of sort also supports '--parallel=N' to distribute the sorting over multiple CPUs.

If the sorted data compresses well, you can also use --compress-program=PROG to compress the temporary files. It is very useful if you have CPUs sitting idle anyway. Depending on how many CPUs you have sitting idle, you can use 'pzstd', 'pigz', 'pbzip2', 'pxz'. They have different level of compression (from low to high).

This way you may be able to lower the disk I/O from 31 GB to 1+1+1+10.

The intermediate result in pipes is not stored anywhere. Instead it is read as soon as it is written. There is only a small buffer (typically in the order of 4-128 KB) between the two processes. When the buffer is full, the writing process blocks untill the reading process has read stuff from the buffer. This technique makes it possible to process 1 TB data on a system with 1 GB RAM and 100 GB disk - as long as the data is compressed when stored on disk.

Internal working of sort/uniq command when input is piped or redirected to it

Question

1 answers

solution1
0 2017-04-26 14:32:28

Internal working of sort/uniq command when input is piped or redirected to it

Question

1 answers

solution1 0 2017-04-26 14:32:28

solution1
0 2017-04-26 14:32:28