简体   繁体   中英

Find the most common line in a file in bash

I have a file of strings:

string-string-123
string-string-123
string-string-123
string-string-12345
string-string-12345
string-string-12345-123

How do I retrieve the most common line in bash ( string-string-123 )?

您可以使用uniq sort

sort file | uniq -c | sort -n -r

You could use awk to do this:

awk '{++a[$0]}END{for(i in a)if(a[i]>max){max=a[i];k=i}print k}' file

The array a keeps a count of each line. Once the file has been read, we loop through it and find the line with the maximum count.

Alternatively, you can skip the loop in the END block by assigning the line during the processing of the file:

awk 'max < ++c[$0] {max = c[$0]; line = $0} END {print line}' file

Thanks to glenn jackman for this useful suggestion.


It has rightly been pointed out that the two approaches above will only print out one of the most frequently occurring lines in the case of a tie. The following version will print out all of the most frequently occurring lines:

awk 'max<++c[$0] {max=c[$0]} END {for(i in c)if(c[i]==max)print i}' file
  • Tom Fenech's elegant awk answer works great [in the amended version that prints all most frequently occurring lines in the event of a tie].
    However, it may not be suitable for large files, because all distinct input lines are stored in an associative array in memory, which could be a problem if there are many non-duplicate lines; that said, it's much faster than the approaches discussed below.

  • Grzegorz Żur's answer combines multiple utilities elegantly to implicitly produce the desired result, but:

    • all distinct lines are printed (highest-frequency count first)
    • output lines are prefixed by their occurrence count (which may actually be desirable).

While you can pipe Grzegorz Żur's answer to head to limit the number of lines shown, you can't assume a fixed number of lines in general.

Building on Grzegorz's answer, here's a generic solution that shows all most-frequently-occurring lines - however many there are - and only them :

sort file | uniq -c | sort -n -r | awk 'NR==1 {prev=$1} $1!=prev {exit} 1'

If you don't want the output lines prefixed with the occurrence count:

sort file | uniq -c | sort -n -r | awk 'NR==1 {prev=$1} $1!=prev {exit} 1' | 
  sed 's/^ *[0-9]\{1,\} //'

Explanation of Grzegorz Żur's answer :

  • uniq -c outputs the set of unique input lines prefixed with their respective occurrence count ( -c ), followed by a single space.
  • sort -n -r then sorts the resulting lines numerically ( -n ), in descending order ( -r ), so that the most frequently occurring line(s) are at the top.
    • Note that sort , if -k is not specified, will generally try to sort by the entire input line, but -n causes only the longest prefix that is recognized as an integer to be used for sorting, which is exactly what's needed here.

Explanation of my awk command:

  • NR==1 {prev=$1} stores the 1st whitespace-separated field ( $1 ) in variable prev for the first input line ( NR==1 )
  • $1!=prev {exit} terminates processing, if the 1st whitespace-separated field is not the same as the previous line's - this means that a non-topmost line has been reached, and no more lines need printing.
  • 1 is shorthand for { print } meaning that the input line at hand should be printed as is.

Explanation of my sed command:

  • ^ *[0-9]\\{1,\\} matches the numeric prefix (denoting the occurrence count) of each output line, as (originally) produced by uniq -c
  • applying s/...// means that the prefix is replaced with an empty string , ie, effectively removed .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM