简体   繁体   English

是否可以使用Linux sort命令对每行末尾的数字进行大型文本文件的排序?

[英]Is it possible to sort a huge text file using Linux sort command by a number at the end of each line?

I am trying to sort a text file where the lines are in the following format: 我正在尝试对文本文件进行排序,其中行的格式如下:

! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 6

and want to sort numerically descending by the number at the end (ie 6 in this example). 并希望按数字下降按最后的数字排序(在本例中为6)。 The lines do not have a predicable number of columns using space as a delimiter, but using ||| 这些行没有可预测数量的列,使用空格作为分隔符,但使用||| as a delimiter there are always 5 columns, and the final column always has 3 space delimited numbers, the last of which to sort by. 作为分隔符,总是有5列,最后一列总是有3个空格分隔的数字,最后一个要排序。 The text file is around 15gb and I did have a perl script I wrote to do it but it only worked on my old laptop which had 32gb of RAM because perl loads the whole file at once. 文本文件大约是15gb,我确实有一个perl脚本,我写了这样做,但它只适用于我的旧笔记本电脑有32GB的RAM,因为perl一次加载整个文件。 Now I am stuck with 8gb RAM and it just churns the swap file for days. 现在我卡住了8GB内存,它只是在几天内交换文件。 I have heard that the standard linux sort command handles huge files more gracefully but I can't find a way to make it use the number at the end. 我听说标准的linux sort命令可以更优雅地处理大文件,但我找不到让它在最后使用数字的方法。

Maybe it is a bit tricky, but this mix of commands can make it: 也许它有点棘手,但这种混合命令可以使它:

awk '$1=$NF" "$1' file | sort -n | cut -d' ' -f2-

The main idea is that we print the file appending the last value in the front of the line, then we sort and we finally remove that value from the output. 主要的想法是我们打印文件,在行的前面附加最后一个值,然后我们排序,最后我们从输出中删除该值。

  • awk '$1=$NF" "$1' file As the parameter you want to sort by is the last one in the file, let's print it also in the first field. awk '$1=$NF" "$1' file由于要排序的参数是文件中的最后一个,我们也可以在第一个字段中打印它。
  • sort -n Then we pipe to sort -n , which sorts numerically. sort -n然后我们管道sort -n ,它按数字排序。
  • cut -d' ' -f2- and we finally print out the value we temporally used. cut -d' ' -f2-我们最终打印出我们暂时使用的值。

Test 测试

$ cat a
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 6
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 79
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 19
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 8
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 89
$ awk '$1=$NF" "$1' a | sort -n | cut -d' ' -f2-
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 6
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 8
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 19
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 79
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 89

Showing each step: 显示每个步骤:

$ awk '$1=$NF" "$1' a 
6 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 6
79 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 79
19 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 19
8 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 8
89 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 89
$ awk '$1=$NF" "$1' a | sort -n
6 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 6
8 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 8
19 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 19
79 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 79
89 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 89
$ awk '$1=$NF" "$1' a | sort -n | cut -d' ' -f2-
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 6
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 8
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 19
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 79
! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 89

It seems that you want to order the file according to the last number, right? 看来你想根据最后一个号码订购文件吧?

So you can duplicate the last field at the start of the line with awk 因此,您可以使用awk复制行开头的最后一个字段

awk -F, '{ print $NF, $0 }' prova

then sort the file with 然后用文件排序

sort -n -k1

and finally remove the fake first field: 最后删除假的第一个字段:

sed 's/^[0-9][0-9]* //'

Here is the script: 这是脚本:

awk -F, '{ print $NF, $0 }' prova | sort -n -k1 | sed 's/^[0-9][0-9]* //'

Since the problem is RAM, perhaps you can reduce the memory required by using Tie::File . 由于问题是RAM,也许你可以减少使用Tie::File所需的内存。 It will allow you to refer to a line by its index in an array. 它允许您通过数组中的索引引用一行。 You can get the numbers to sort by and use a Schwartzian transform to get a sorted list of indexes, and then simply reprint the file at the end. 您可以获取要排序的数字并使用Schwartzian变换来获取索引的排序列表,然后在最后重新打印该文件。

use strict;
use warnings;
use Tie::File;

my $file = shift;                           # your filename argument
tie my @lines, 'Tie::File', $file or die $!;
my @list = map $_->[0],                     # restore line number
           sort { $b->[1] <=> $a->[1] }     # sort on captured number
           map { [ $_, $lines[$_] =~ /(\d+)$/ ] } 0 .. $#lines;
           # store an array ref [ ... ] containing line number and number to 
           # sort by
@lines = @lines[@list];

The last operation will save the file in the sorted order. 最后一个操作将按排序顺序保存文件。 Note that this is a permanent change, so make backups. 请注意,这是永久性更改,因此请进行备份。 It is also an expensive operation, probably, and Tie::File has had some performance issues. 它也可能是一个昂贵的操作,并且Tie::File有一些性能问题。 Another way to do it, that is probably less expensive is to simply iterate over the list of numbers and printing line by line to a new file: 另一种方法,可能更便宜的是简单地遍历数字列表并逐行打印到新文件:

open my $fh, ">", "output.csv" or die $!;
for my $num (@list) {
    print $fh $lines[$num], $/;
}

This printing directly to a file circumvents any shell caching required by redirecting output 直接打印到文件会绕过重定向输出所需的任何shell缓存

Assuming I'm allowed to ruin the original file (make a copy otherwise), you can use sort on the last column by rolling through the file once and turning the last column into a predictable column number. 假设我被允许破坏原始文件(否则进行复制),您可以通过滚动文件一次并将最后一列转换为可预测的列号,对最后一列使用sort。 I'm using the @ symbol as something that I assume will not be in your data. 我正在使用@符号作为我认为不会出现在您数据中的内容。 Anything can be substituted if that's a bad assumption. 如果这是一个糟糕的假设,任何事情都可以替代。

sed -i 's/ /@/g; s/@\([^@]*\)$/ \1/;' in.txt
# the file now looks like "!@!@|||@whatever@||| 6"
sort --buffer-size=1G -nk 2 in.txt | sed 's/@/ /g' > sorted.txt

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM