简体   繁体   English

如何删除 Linux 文本文件中每个(相同)行的 1 个实例?

[英]How to remove 1 instance of each (identical) line in a text file in Linux?

There is a file:有一个文件:

Mary 
Mary 
Mary 
Mary 
John 
John 
John 
Lucy 
Lucy 
Mark

I need to get我需要得到

Mary 
Mary 
Mary 
John 
John 
Lucy

I cannot get the lines ordered according to how many times each line is repeated in the text, ie the most frequently occurring lines must be listed first.我无法根据每行在文本中重复的次数对行进行排序,即最常出现的行必须首先列出。

If your file is already sorted (most-frequent words at top, repeated words only in consecutive lines) – your question makes it look like that's the case – you could reformulate your problem to: "Skip a word when it is encountered for the first time".如果你的文件已经排序(最常用的词在顶部,重复的词只在连续的行中)——你的问题使它看起来是这样——你可以将你的问题重新表述为:“跳过第一个遇到的词时间”。 Then a possible (and efficient) awk solution would be:那么一个可能的(有效的)awk 解决方案是:

awk 'prev==$0{print}{prev=$0}'

or if you prefer an approach that looks more familiar if coming from other programming languages:或者如果您更喜欢来自其他编程语言的看起来更熟悉的方法:

awk '{if(prev==$0)print;prev=$0}'

Partially working solutions below.下面的部分工作解决方案。 I'll keep them for reference, maybe they are helpful to somebody else.我会保留它们以供参考,也许它们对其他人有帮助。

If your file is not too big, you could use awk to count identical lines and then output each group the number of times it occurred, minus 1.如果您的文件不是太大,您可以使用 awk 来计算相同的行,然后 output 每组计算它出现的次数,减 1。

awk '
{ lines[$0]++ }
END {
  for (line in lines) {
    for (i = 1; i < lines[line]; ++i) {
      print line
    }
  }
}
'

Since you mentioned that the most frequent line must come first, you have to sort first:由于您提到最频繁的行必须排在第一位,因此您必须先排序:

sort | uniq -c | sort -nr | awk '{count=$1;for(i=1;i<count;++i){$1="";print}}' | cut -c2-

Note that the latter will reformat your lines (eg collapsing/squeezing repeated spaces).请注意,后者将重新格式化您的行(例如折叠/压缩重复的空格)。 See Is there a way to completely delete fields in awk, so that extra delimiters do not print?请参阅是否有办法完全删除 awk 中的字段,以便不打印额外的分隔符?

don't sort for no reason:不要无缘无故地sort

 nawk '_[$-__]--' gawk '__[$_]++' mawk '__[$_]++'
Mary 
Mary 
Mary 
John 
John 
Lucy 

for 1 GB+ files, u can speed things up a bit by preventing FS from splitting unnecessary fields对于1 GB+以上的文件,您可以通过防止FS拆分不必要的字段来加快速度

mawk2 '__[$_]++' FS='\n'

for 100 GB inputs, one idea would be to use parallel to create, say, 10 instances of awk , piping the full 100 GB to each instance, but assigning each of them a particular range to partition on their end对于100 GB的输入,一个想法是使用parallel创建awk的 10 个实例,将完整的100 GB传输到每个实例,但为每个实例分配一个特定的范围以在它们的末端进行分区

(eg instance 4 handle lines beginning with FQ , etc), but instead of outputting it all THEN attempt to sort the monstrosity, what one could do is simply have them tally up, and only print out a frequency report of how many copies ( "Nx" ) of each unique line ( "Lx" ) has been recorded. (例如,以FQ开头的实例 4 句柄行等),但不是全部输出然后尝试对怪物进行排序,我们可以做的就是简单地计算它们,并且只打印出一份频率报告,说明有多少份( "Nx" ) 的每个唯一行 ( "Lx" ) 已被记录。

From there one could sort a much smaller file along the column holding the Lx 's, THEN pipe it to one more awk that would print out Nx # copies of each line Lx .从那里可以沿着包含Lx的列对一个小得多的文件进行排序,然后 pipe 它到另一个awk将打印出每行LxNx # 个副本。

probably a lot faster than trying to sort 100 GB可能比尝试对100 GB进行排序快得多

I created a test scenario by cloning 71 shuffled copies of a raw file with these stats:我通过克隆具有以下统计数据的原始文件的 71 个随机副本创建了一个测试场景:

 uniq rows = 8125950. | UTF8 chars = 160950688. | bytes = 160950688.

 —- 8.12 mn unique rows spanning 154 MB

……resulting in a 10.6 GB test file: ……产生一个10.6 GB的测试文件:

  in0: 10.6GiB 0:00:30 [ 354MiB/s] [ 354MiB/s] [============>] 100%            
  rows = 576942450. | UTF8 chars = 11427498848. | bytes = 11427498848.

even when using just 1 single instance of awk , it finished filtering the 10.6 GB in ~13.25 mins - reasonable given the fact it's tracking 8.1 mn unique hash keys.即使仅使用awk的 1 个实例,它也能在~13.25 mins完成对10.6 GB的过滤——这是合理的,因为它正在跟踪 810 万个唯一的 hash 键。

  in0: 10.6GiB 0:13:12 [13.7MiB/s] [13.7MiB/s] [============>] 100%            
 out9: 10.5GiB 0:13:12 [13.6MiB/s] [13.6MiB/s] [<=> ]

 ( pvE 0.1 in0 < testfile.txt | mawk2 '__[$_]++' FS='\n' )

  783.31s user 15.51s system 100% cpu 13:12.78 total


  5e5f8bbee08c088c0c4a78384b3dd328  stdin

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在Linux中的.dat文件的每一行末尾删除管道 - How to remove pipe at end of each line of a .dat file in linux UNIX \\ LINUX:如何向文件内的每一行添加目录文本? - UNIX\LINUX: How to add a directory text to each line inside a file? 如何使用linux命令删除文本文件中的部分行 - how to remove a part of line in text file using linux command 在Linux中文件的每一行之前添加文本 - Adding text before each line in file in linux 在Linux上根据文本文件的相应编号展开每一行 - Expand each line of text file according to their corresponding numbers on linux 如何从 Linux 的文本文件中删除 ^@ - How to remove ^@ from text file in Linux 根据条件 Linux 从多行文本文件的每一行中提取文本 - Extract text from each line from a multiple-line text file based on a condition, Linux 如何逐行读取文件并将每一行作为参数输入到 a.exe 文件并将 output 捕获到 linux 中的另一个文件 - how to read a file line by line and each line as an argument input to a .exe file and capture the output to another file in linux 如何表达“将linux命令映射到文件中的每一行”? - How to express “map a linux command to each line in a file”? 如何删除 linux/unix 文件中特定行中间的逗号 - How to remove commas in the middle of a specific line in a file in linux/unix
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM