合并排序gzip压缩文件

Question

I have 40 files of 2GB each, stored on an NFS architecture. 我有40个文件，每个2GB，存储在NFS架构上。 Each file contains two columns: a numeric id and a text field. 每个文件包含两列：数字ID和文本字段。 Each file is already sorted and gzipped. 每个文件都已经过排序和gzip压缩。

How can I merge all of these files so that the resulting output is also sorted? 如何合并所有这些文件，以便生成的输出也被排序？

I know sort -m -k 1 should do the trick for uncompressed files, but I don't know how to do it directly with the compressed ones. 我知道sort -m -k 1应该为未压缩的文件做技巧，但我不知道如何直接使用压缩文件。

PS: I don't want the simple solution of uncompressing the files into disk, merging them, and compressing again, as I don't have sufficient disk space for that. PS：我不想要将文件解压缩到磁盘，合并它们以及再次压缩的简单解决方案，因为我没有足够的磁盘空间。

Answer 1

This is a use case for process substitution . 这是进程替换的用例。 Say you have two files to sort, sorta.gz and sortb.gz . 假设您有两个要排序的文件， sorta.gz和sortb.gz 。 You can give the output of gunzip -c FILE.gz to sort for both of these files using the <(...) shell operator: 您可以使用<(...) shell运算符为gunzip -c FILE.gz输出以对这两个文件进行排序：

sort -m -k1 <(gunzip -c sorta.gz) <(gunzip -c sortb.gz) >sorted

Process substitution substitutes a command with a file name that represents the output of that command, and is typically implemented with either a named pipe or a /dev/fd/... special file. 进程替换使用表示该命令输出的文件名替换命令，并且通常使用命名管道或/dev/fd/...特殊文件实现。

For 40 files, you will want to create the command with that many process substitutions dynamically, and use eval to execute it: 对于40个文件，您将需要动态创建具有许多进程替换的命令，并使用eval执行它：

cmd="sort -m -k1 "
for input in file1.gz file2.gz file3.gz ...; do
    cmd="$cmd <(gunzip -c '$input')"
done
eval "$cmd" >sorted       # or eval "$cmd" | gzip -c > sorted.gz

Answer 2

    #!/bin/bash

    FILES=file*.gz               # list of your 40 gzip files
                                 # (e.g. file1.gz ... file40.gz)

    WORK1="merged.gz"            # first temp file and the final file
    WORK2="tempfile.gz"          # second temp file

    > "$WORK1"                   # create empty final file
    > "$WORK2"                   # create empty temp file

    gzip -qc "$WORK2" > "$WORK1" # compress content of empty second
                                 # file to first temp file

    for I in $FILES; do
        echo current file: "$I"
        sort -k 1 -m <(gunzip -c "$I") <(gunzip -c "$WORK1") | gzip -c > "$WORK2"
        mv "$WORK2" "$WORK1"
    done

Fill $FILES the easiest way with the list of your files with bash globbing (file*.gz) or with a list of 40 filenames (separated with white blanks). 使用bash globbing（文件* .gz）或40个文件名列表（用白色空格分隔）填写$ FILES最简单的文件列表。 Your files in $FILES stay unchanged. $ FILES中的文件保持不变。

Finally, the 80 GB data are compressed in $WORK1. 最后，80 GB数据在$ WORK1中压缩。 While processing this script no uncompressed data where written to disk. 处理此脚本时，没有未压缩的数据写入磁盘。

Answer 3

Adding a differently flavoured multi-file merge within a single pipeline - it takes all (pre-sorted) files in $OUT/uniques , sort-merges them and compresses the output, lz4 is used due to it's speed: 在单个管道中添加不同风格的多文件合并 - 它需要$OUT/uniques uniques中的所有（预先排序的）文件，对它们进行排序合并并压缩输出，因为它的速度使用lz4：

find $OUT/uniques -name '*.lz4' |
  awk '{print "<( <" $0 " lz4cat )"}' |
  tr "\n" " " |
  (echo -n sort -m -k3b -k2 " "; cat -; echo) |
  bash |
  lz4 \
> $OUT/uniques-merged.tsv.lz4

Answer 4

确实有zgrep和其他常用实用程序可以使用压缩文件，但在这种情况下，您需要对未压缩数据进行排序/合并并压缩结果。

合并排序gzip压缩文件

问题描述

4 个解决方案

解决方案1
16 已采纳 2014-07-04 22:12:17

解决方案2
2 2014-07-03 21:01:04

解决方案3
1 2016-10-04 11:40:23

解决方案4
-2 2014-07-03 20:50:38

合并排序gzip压缩文件

问题描述

4 个解决方案

解决方案1 16 已采纳 2014-07-04 22:12:17

解决方案2 2 2014-07-03 21:01:04

解决方案3 1 2016-10-04 11:40:23

解决方案4 -2 2014-07-03 20:50:38

解决方案1
16 已采纳 2014-07-04 22:12:17

解决方案2
2 2014-07-03 21:01:04

解决方案3
1 2016-10-04 11:40:23

解决方案4
-2 2014-07-03 20:50:38