[英]sort across multiple files in linux
I have multiple (many) files; 我有多个文件。 each very large:
每个非常大:
file0.txt
file1.txt
file2.txt
I do not want to join them into a single file because the resulting file would be 10+ Gigs. 我不想将它们加入单个文件,因为生成的文件将是10+ Gigs。 Each line in each file contains a 40-byte string.
每个文件中的每一行都包含一个40字节的字符串。 The strings are fairly well ordered right now, (about 1:10 steps is a decrease in value instead of an increase).
字符串现在排列得很好(大约1:10的步长是值的减少而不是增加)。
I would like the lines ordered. 我想要排队。 (in-place if possible?) This means some of the lines from the end of
file0.txt
will be moved to the beginning of file1.txt
and vice versa. (如果可能,是否在适当的位置?)这意味着从
file0.txt
末尾开始的file0.txt
行将移至file1.txt
的开始,反之亦然。
I am working on Linux and fairly new to it. 我正在Linux上工作,并且还很新。 I know about the
sort
command for a single file, but am wondering if there is a way to sort across multiple files. 我知道单个文件的
sort
命令,但是想知道是否有一种方法可以对多个文件进行排序。 Or maybe there is a way to make a pseudo-file made from smaller files that linux will treat as a single file. 或者,也许有一种方法可以将较小的文件制成伪文件,而linux将这些文件视为单个文件。
What I know can do: I can sort each file individually and read into file1.txt
to find the value larger than the largest in file0.txt
(and similarly grab the lines from the end of file0.txt
), join and then sort.. but this is a pain and assumes no values from file2.txt
belong in file0.txt
(however highly unlikely in my case) 我知道可以做什么:我可以分别对每个文件进行排序,然后读入
file1.txt
以查找大于file0.txt
最大文件的值(并且类似地从file0.txt
末尾file0.txt
), file0.txt
然后进行排序。 ,但是这很file2.txt
并且假设file2.txt
任何值file2.txt
属于file0.txt
(但是在我看来,这种可能性很小)
To be clear, if the files look like this: 要明确的是,如果文件如下所示:
f0.txt
DDD
XXX
AAA
f1.txt
BBB
FFF
CCC
f2.txt
EEE
YYY
ZZZ
I want this: 我要这个:
f0.txt
AAA
BBB
CCC
f1.txt
DDD
EEE
FFF
f2.txt
XXX
YYY
ZZZ
I don't know about a command doing in-place sorting, but I think a faster "merge sort" is possible: 我不知道执行原位排序的命令,但我认为可以进行更快的“合并排序”:
for file in *.txt; do
sort -o $file $file
done
sort -m *.txt | split -d -l 1000000 - output
sort
in the for loop makes sure the content of the input files is sorted. sort
确保输入文件的内容已排序。 If you don't want to overwrite the original, simply change the value after the -o
parameter. -o
参数后面更改值即可。 (If you expect the files to be sorted already, you could change the sort statement to "check-only": sort -c $file || exit 1
) sort -c $file || exit 1
) sort
does efficient merging of the input files, all while keeping the output sorted. sort
有效合并输入文件,同时保持输出排序。 split
command which will then write to suffixed output files. split
命令的,然后它将写入后缀的输出文件。 Notice the -
character; -
字符; this tells split to read from standard input (ie the pipe) instead of a file. Also, here's a short summary of how the merge sort works: 另外,这是合并排序方式的简短摘要:
sort
reads a line from each file. sort
从每个文件读取一行。 It isn't exactly what you asked for, but the sort(1)
utility can help, a little, using the --merge
option. 这并不是您真正想要的,但是使用
--merge
选项可以使sort(1)
实用程序有所帮助。 Sort each file individually, then sort the resulting pile of files: 分别对每个文件进行排序,然后对生成的文件堆进行排序:
for f in file*.txt ; do sort -o $f < $f ; done
sort --merge file*.txt | split -l 100000 - sorted_file
(That's 100,000 lines per output file. Perhaps that's still way too small.) (每个输出文件100,000行。也许这仍然太小了。)
I believe that this is your best bet, using stock linux utilities: 我相信,这是使用库存Linux实用程序的最佳选择:
sort each file individually, eg for f in file*.txt; do sort $f > sorted_$f.txt; done
分别对每个文件进行排序,例如
for f in file*.txt; do sort $f > sorted_$f.txt; done
排序for f in file*.txt; do sort $f > sorted_$f.txt; done
for f in file*.txt; do sort $f > sorted_$f.txt; done
sort everything using sort -m sorted_file*.txt | split -d -l <lines> - <prefix>
使用
sort -m sorted_file*.txt | split -d -l <lines> - <prefix>
对所有内容进行sort -m sorted_file*.txt | split -d -l <lines> - <prefix>
sort -m sorted_file*.txt | split -d -l <lines> - <prefix>
, where <lines>
is the number of lines per file, and <prefix>
is the filename prefix. sort -m sorted_file*.txt | split -d -l <lines> - <prefix>
,其中<lines>
是每个文件的行数,而<prefix>
是文件名前缀。 (The -d
tells split to use numeric suffixes). (
-d
告诉split使用数字后缀)。
The -m
option to sort lets it know the input files are already sorted, so it can be smart. 排序的
-m
选项使它知道输入文件已经排序,因此可以很聪明。
mmap() the 3 files, as all lines are 40 bytes long, you can easily sort them in place (SIP :-). mmap()这3个文件,因为所有行的长度均为40字节,因此您可以轻松地对它们进行排序(SIP :-)。 Don't forget the msync at the end.
不要忘记最后的msync。
如果文件是单独排序的,则可以使用sort -m file*.txt
将它们合并在一起-读取每个文件的第一行,输出最小的文件,然后重复。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.