I have been trying to understand the execution and internal data structures and algorithms involved in the below command when executed in Linux.
bzip -dc mybig.bz2 | cut -d ',' -f 1,2,4,5,9,10,12 | sort > output_file
sort
sort without having all the inputs at once? mybig.bz2
is 20GB in size how is sort managing all of the intermediate results in disk for such huge files ? Using redirection and intermediate files.
bzip -dc mybig.bz2 > temp1
cut -d ',' -f 1,2,4,5,9,10,12 temp1 > temp2
sort temp2 > output_file
Using pipes.
bzip -dc mybig.bz2 | cut -d ',' -f 1,2,4,5,9,10,12 | sort > output_file
Is there a better way to do this using shell, where cat
, cut
and sort
is running in parallel (line buffered) and does minimum of disk I/O and cpu cycles ?
Any help highly appreciated.
Using redirection and intermediate files
bzip -dc mybig.bz2 > temp1
cut -d ',' -f 1,2,4,5,9,10,12 temp1 > temp2
sort temp2 > output_file
Let us assume mybig.bz2 is 1 GB and the uncompressed version is 10 GB. The above will then:
In total disk I/O of 1+10+10+10+10+10+10+10 = 71 GB.
Using pipes
bzip -dc mybig.bz2 | cut -d ',' -f 1,2,4,5,9,10,12 | sort > output_file
Here you:
In total disk I/O of 1+10+10+10 = 31 GB.
You waste nothing by using pipes. On the contrary if the bzip2 is the same speed as sorting, you can keep 2 CPU's running in parallel. Newer versions of sort also supports '--parallel=N' to distribute the sorting over multiple CPUs.
If the sorted data compresses well, you can also use --compress-program=PROG
to compress the temporary files. It is very useful if you have CPUs sitting idle anyway. Depending on how many CPUs you have sitting idle, you can use 'pzstd', 'pigz', 'pbzip2', 'pxz'. They have different level of compression (from low to high).
This way you may be able to lower the disk I/O from 31 GB to 1+1+1+10.
The intermediate result in pipes is not stored anywhere. Instead it is read as soon as it is written. There is only a small buffer (typically in the order of 4-128 KB) between the two processes. When the buffer is full, the writing process blocks untill the reading process has read stuff from the buffer. This technique makes it possible to process 1 TB data on a system with 1 GB RAM and 100 GB disk - as long as the data is compressed when stored on disk.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.