[英]How to remove 1 instance of each (identical) line in a text file in Linux?
There is a file:有一个文件:
Mary
Mary
Mary
Mary
John
John
John
Lucy
Lucy
Mark
I need to get我需要得到
Mary
Mary
Mary
John
John
Lucy
I cannot get the lines ordered according to how many times each line is repeated in the text, ie the most frequently occurring lines must be listed first.我无法根据每行在文本中重复的次数对行进行排序,即最常出现的行必须首先列出。
If your file is already sorted (most-frequent words at top, repeated words only in consecutive lines) – your question makes it look like that's the case – you could reformulate your problem to: "Skip a word when it is encountered for the first time".如果你的文件已经排序(最常用的词在顶部,重复的词只在连续的行中)——你的问题使它看起来是这样——你可以将你的问题重新表述为:“跳过第一个遇到的词时间”。 Then a possible (and efficient) awk solution would be:那么一个可能的(有效的)awk 解决方案是:
awk 'prev==$0{print}{prev=$0}'
or if you prefer an approach that looks more familiar if coming from other programming languages:或者如果您更喜欢来自其他编程语言的看起来更熟悉的方法:
awk '{if(prev==$0)print;prev=$0}'
Partially working solutions below.下面的部分工作解决方案。 I'll keep them for reference, maybe they are helpful to somebody else.我会保留它们以供参考,也许它们对其他人有帮助。
If your file is not too big, you could use awk to count identical lines and then output each group the number of times it occurred, minus 1.如果您的文件不是太大,您可以使用 awk 来计算相同的行,然后 output 每组计算它出现的次数,减 1。
awk '
{ lines[$0]++ }
END {
for (line in lines) {
for (i = 1; i < lines[line]; ++i) {
print line
}
}
}
'
Since you mentioned that the most frequent line must come first, you have to sort first:由于您提到最频繁的行必须排在第一位,因此您必须先排序:
sort | uniq -c | sort -nr | awk '{count=$1;for(i=1;i<count;++i){$1="";print}}' | cut -c2-
Note that the latter will reformat your lines (eg collapsing/squeezing repeated spaces).请注意,后者将重新格式化您的行(例如折叠/压缩重复的空格)。 See Is there a way to completely delete fields in awk, so that extra delimiters do not print?请参阅是否有办法完全删除 awk 中的字段,以便不打印额外的分隔符?
don't sort
for no reason:不要无缘无故地sort
:
nawk '_[$-__]--' gawk '__[$_]++' mawk '__[$_]++'
Mary
Mary
Mary
John
John
Lucy
for 1 GB+
files, u can speed things up a bit by preventing FS
from splitting unnecessary fields对于1 GB+
以上的文件,您可以通过防止FS
拆分不必要的字段来加快速度
mawk2 '__[$_]++' FS='\n'
for 100 GB
inputs, one idea would be to use parallel
to create, say, 10 instances of awk
, piping the full 100 GB
to each instance, but assigning each of them a particular range to partition on their end对于100 GB
的输入,一个想法是使用parallel
创建awk
的 10 个实例,将完整的100 GB
传输到每个实例,但为每个实例分配一个特定的范围以在它们的末端进行分区
(eg instance 4 handle lines beginning with FQ
, etc), but instead of outputting it all THEN attempt to sort the monstrosity, what one could do is simply have them tally up, and only print out a frequency report of how many copies ( "Nx"
) of each unique line ( "Lx"
) has been recorded. (例如,以FQ
开头的实例 4 句柄行等),但不是全部输出然后尝试对怪物进行排序,我们可以做的就是简单地计算它们,并且只打印出一份频率报告,说明有多少份( "Nx"
) 的每个唯一行 ( "Lx"
) 已被记录。
From there one could sort a much smaller file along the column holding the Lx
's, THEN pipe it to one more awk
that would print out Nx
# copies of each line Lx
.从那里可以沿着包含Lx
的列对一个小得多的文件进行排序,然后 pipe 它到另一个awk
将打印出每行Lx
的Nx
# 个副本。
probably a lot faster than trying to sort 100 GB
可能比尝试对100 GB
进行排序快得多
I created a test scenario by cloning 71 shuffled copies of a raw file with these stats:我通过克隆具有以下统计数据的原始文件的 71 个随机副本创建了一个测试场景:
uniq rows = 8125950. | UTF8 chars = 160950688. | bytes = 160950688.
—- 8.12 mn unique rows spanning 154 MB
……resulting in a 10.6 GB
test file: ……产生一个10.6 GB
的测试文件:
in0: 10.6GiB 0:00:30 [ 354MiB/s] [ 354MiB/s] [============>] 100%
rows = 576942450. | UTF8 chars = 11427498848. | bytes = 11427498848.
even when using just 1 single instance of awk
, it finished filtering the 10.6 GB
in ~13.25 mins
- reasonable given the fact it's tracking 8.1 mn unique hash keys.即使仅使用awk
的 1 个实例,它也能在~13.25 mins
完成对10.6 GB
的过滤——这是合理的,因为它正在跟踪 810 万个唯一的 hash 键。
in0: 10.6GiB 0:13:12 [13.7MiB/s] [13.7MiB/s] [============>] 100%
out9: 10.5GiB 0:13:12 [13.6MiB/s] [13.6MiB/s] [<=> ]
( pvE 0.1 in0 < testfile.txt | mawk2 '__[$_]++' FS='\n' )
783.31s user 15.51s system 100% cpu 13:12.78 total
5e5f8bbee08c088c0c4a78384b3dd328 stdin
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.