简体   繁体   English

在Linux中对多个文件进行排序

[英]sort across multiple files in linux

I have multiple (many) files; 我有多个文件。 each very large: 每个非常大:

file0.txt
file1.txt
file2.txt

I do not want to join them into a single file because the resulting file would be 10+ Gigs. 我不想将它们加入单个文件,因为生成的文件将是10+ Gigs。 Each line in each file contains a 40-byte string. 每个文件中的每一行都包含一个40字节的字符串。 The strings are fairly well ordered right now, (about 1:10 steps is a decrease in value instead of an increase). 字符串现在排列得很好(大约1:10的步长是值的减少而不是增加)。

I would like the lines ordered. 我想要排队。 (in-place if possible?) This means some of the lines from the end of file0.txt will be moved to the beginning of file1.txt and vice versa. (如果可能,是否在适当的位置?)这意味着从file0.txt末尾开始的file0.txt行将移至file1.txt的开始,反之亦然。

I am working on Linux and fairly new to it. 我正在Linux上工作,并且还很新。 I know about the sort command for a single file, but am wondering if there is a way to sort across multiple files. 我知道单个文件的sort命令,但是想知道是否有一种方法可以对多个文件进行排序。 Or maybe there is a way to make a pseudo-file made from smaller files that linux will treat as a single file. 或者,也许有一种方法可以将较小的文件制成伪文件,而linux将这些文件视为单个文件。

What I know can do: I can sort each file individually and read into file1.txt to find the value larger than the largest in file0.txt (and similarly grab the lines from the end of file0.txt ), join and then sort.. but this is a pain and assumes no values from file2.txt belong in file0.txt (however highly unlikely in my case) 我知道可以做什么:我可以分别对每个文件进行排序,然后读入file1.txt以查找大于file0.txt最大文件的值(并且类似地从file0.txt末尾file0.txt ), file0.txt然后进行排序。 ,但是这很file2.txt并且假设file2.txt任何值file2.txt属于file0.txt (但是在我看来,这种可能性很小)

Edit 编辑

To be clear, if the files look like this: 要明确的是,如果文件如下所示:

f0.txt
DDD
XXX
AAA

f1.txt
BBB
FFF
CCC

f2.txt
EEE
YYY
ZZZ

I want this: 我要这个:

f0.txt
AAA
BBB
CCC

f1.txt
DDD
EEE
FFF

f2.txt
XXX
YYY
ZZZ

I don't know about a command doing in-place sorting, but I think a faster "merge sort" is possible: 我不知道执行原位排序的命令,但我认为可以进行更快的“合并排序”:

for file in *.txt; do
    sort -o $file $file
done
sort -m *.txt | split -d -l 1000000 - output
  • The sort in the for loop makes sure the content of the input files is sorted. for循环中的sort确保输入文件的内容已排序。 If you don't want to overwrite the original, simply change the value after the -o parameter. 如果您不想覆盖原始文件,只需在-o参数后面更改值即可。 (If you expect the files to be sorted already, you could change the sort statement to "check-only": sort -c $file || exit 1 ) (如果您希望文件已经被排序,则可以将sort语句更改为“ check-only”: sort -c $file || exit 1
  • The second sort does efficient merging of the input files, all while keeping the output sorted. 第二种sort有效合并输入文件,同时保持输出排序。
  • This is piped to the split command which will then write to suffixed output files. 这是通过管道传递给split命令的,然后它将写入后缀的输出文件。 Notice the - character; 注意-字符; this tells split to read from standard input (ie the pipe) instead of a file. 这告诉split从标准输入(即管道)读取而不是从文件读取。

Also, here's a short summary of how the merge sort works: 另外,这是合并排序方式的简短摘要:

  1. sort reads a line from each file. sort从每个文件读取一行。
  2. It orders these lines and selects the one which should come first. 它对这些行进行排序,然后选择应该排在最前面的那一行。 This line gets sent to the output, and a new line is read from the file which contained this line. 该行被发送到输出,并且从包含该行的文件中读取新行。
  3. Repeat step 2 until there are no more lines in any file. 重复步骤2,直到在任何文件中没有更多的行。
  4. At this point, the output should be a perfectly sorted file. 此时,输出应该是一个完美排序的文件。
  5. Profit! 利润!

It isn't exactly what you asked for, but the sort(1) utility can help, a little, using the --merge option. 这并不是您真正想要的,但是使用--merge选项可以使sort(1)实用程序有所帮​​助。 Sort each file individually, then sort the resulting pile of files: 分别对每个文件进行排序,然后对生成的文件堆进行排序:

for f in file*.txt ; do sort -o $f < $f ; done
sort --merge file*.txt | split -l 100000 - sorted_file

(That's 100,000 lines per output file. Perhaps that's still way too small.) (每个输出文件100,000行。也许这仍然太小了。)

I believe that this is your best bet, using stock linux utilities: 我相信,这是使用库存Linux实用程序的最佳选择:

  • sort each file individually, eg for f in file*.txt; do sort $f > sorted_$f.txt; done 分别对每个文件进行排序,例如for f in file*.txt; do sort $f > sorted_$f.txt; done排序for f in file*.txt; do sort $f > sorted_$f.txt; done for f in file*.txt; do sort $f > sorted_$f.txt; done

  • sort everything using sort -m sorted_file*.txt | split -d -l <lines> - <prefix> 使用sort -m sorted_file*.txt | split -d -l <lines> - <prefix>对所有内容进行sort -m sorted_file*.txt | split -d -l <lines> - <prefix> sort -m sorted_file*.txt | split -d -l <lines> - <prefix> , where <lines> is the number of lines per file, and <prefix> is the filename prefix. sort -m sorted_file*.txt | split -d -l <lines> - <prefix> ,其中<lines>是每个文件的行数,而<prefix>是文件名前缀。 (The -d tells split to use numeric suffixes). -d告诉split使用数字后缀)。

The -m option to sort lets it know the input files are already sorted, so it can be smart. 排序的-m选项使它知道输入文件已经排序,因此可以很聪明。

mmap() the 3 files, as all lines are 40 bytes long, you can easily sort them in place (SIP :-). mmap()这3个文件,因为所有行的长度均为40字节,因此您可以轻松地对它们进行排序(SIP :-)。 Don't forget the msync at the end. 不要忘记最后的msync。

如果文件是单独排序的,则可以使用sort -m file*.txt将它们合并在一起-读取每个文件的第一行,输出最小的文件,然后重复。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM