简体   繁体   English

使用“uniq -c”命令时需要从 output 中删除计数

[英]Need to remove the count from the output when using "uniq -c" command

I am trying to read a file and sort it by number of occurrences of a particular field.我正在尝试读取一个文件并按特定字段的出现次数对其进行排序。 Suppose i want to find out the most repeated date from a log file then i use uniq -c option and sort it in descending order.假设我想从日志文件中找出重复次数最多的日期,然后我使用 uniq -c 选项并按降序对它进行排序。 something like this像这样的

uniq -c | sort -nr 

This will produce some output like this -这将产生一些像这样的 output -

809 23/Dec/2008:19:20

the first field which is actually the count is the problem for me.... i want to get ony the date from the above output but m not able to get this.第一个实际上是计数的字段对我来说是个问题....我想从上面的 output 中获取日期,但我无法得到这个。 I tried to use cut command and did this我尝试使用 cut 命令并执行了此操作

uniq -c | sort -nr | cut -d' ' -f2 

but this just prints blank space... please can someone help me on getting the date only and chop off the count.但这只会打印出空白...请有人帮我只获取日期并减少计数。 I want only我只要

23/Dec/2008:19:20

Thanks谢谢

The count from uniq is preceded by spaces unless there are more than 7 digits in the count, so you need to do something like:来自uniq的计数前面有空格,除非计数中有超过 7 位数字,因此您需要执行以下操作:

uniq -c | sort -nr | cut -c 9-

to get columns (character positions) 9 upwards.获得列(字符位置)9 向上。 Or you can use sed :或者您可以使用sed

uniq -c | sort -nr | sed 's/^.\{8\}//'

or:要么:

uniq -c | sort -nr | sed 's/^ *[0-9]* //'

This second option is robust in the face of a repeat count of 10,000,000 or more;面对 10,000,000 或更多的重复计数,第二种选择是稳健的; if you think that might be a problem, it is probably better than the cut alternative.如果您认为这可能是个问题,那么它可能比cut替代方案更好。 And there are undoubtedly other options available too.毫无疑问,还有其他选择。


Caveat: the counts were determined by experimentation on Mac OS X 10.7.3 but using GNU uniq from coreutils 8.3.警告:计数是通过在 Mac OS X 10.7.3 上进行的实验确定的,但使用的是来自coreutils 8.3 的 GNU uniq The BSD uniq -c produced 3 leading spaces before a single digit count. BSD uniq -c在单个数字计数之前产生了 3 个前导空格。 The POSIX spec says the output from uniq -c shall be formatted as if with: POSIX 规范说来自uniq -c的 output 应该被格式化为:

printf("%d %s", repeat_count, line);

which would not have any leading blanks.不会有任何前导空白。 Given this possible variance in output formats, the sed script with the [0-9] regex is the most reliable way of dealing with the variability in observed and theoretical output from uniq -c :鉴于 output 格式的这种可能差异,带有[0-9]正则表达式的sed脚本是处理来自uniq -c的观察到的和理论上的 output 的可变性的最可靠方法:

uniq -c | sort -nr | sed 's/^ *[0-9]* //'

Instead of cut -d' ' -f2 , try而不是cut -d' ' -f2 ,尝试

awk '{$1="";print}'

Maybe you need to remove one more blank in the beginning:也许您需要在开始时再删除一个空白:

awk '{$1="";print}' | sed 's/^.//'

or completly with sed, preserving original whitspace:或完全使用 sed,保留原始空白:

sed -r 's/^[^0-9]*[0-9]+//'

Following awk may help you here. awk或许能帮到你。

awk '{a[$0]++} END{for(i in a){print a[i],i | "sort -k2"}}'  Input_file

Solution 2nd: In case you want order of output to be same as input but not as sort.解决方案 2:如果您希望 output 的顺序与输入相同但与排序不同。

awk '!a[$0]++{b[++count]=$0} {c[$0]++} END{for(i=1;i<=count;i++){print c[b[i]],b[i]}}'  Input_file

an alternative solution is this:另一种解决方案是:

uniq -c | sort -nr | awk '{print $1, $2}'

also you may easily print a single field.您也可以轻松打印单个字段。

use(since you use -f2 in the cut in your question)使用(因为你在你的问题中使用 -f2 )

cat file |sort |uniq -c | awk '{ print $2; }'

If you want to work with the count field downstream, following command will reformat it to a 'pipe friendly' tab delimited format without the left padding:如果您想在下游使用计数字段,以下命令会将其重新格式化为“管道友好”制表符分隔格式,不带左填充:

 .. | sort | uniq -c | sed -r 's/^ +([0-9]+) /\1\t/'

For the original task it is a bit of an overkill, but after reformatting, cut can be used to remove the field, as OP intended:对于原始任务来说,这有点矫枉过正,但在重新格式化后,可以使用cut来删除字段,正如 OP 所期望的那样:

 .. | sort | uniq -c | sed -r 's/^ +([0-9]+) /\1\t/' | cut -d $'\t' -f2-

Add tr -s to the pipe chain to "squeeze" multiple spaces into one space delimiter:tr -s添加到 pipe 链以将多个空格“压缩”为一个空格分隔符:

uniq -c | tr -s ' ' | cut -d ' ' -f3

tr is very useful in some obscure places. tr在一些不起眼的地方非常有用。 Unfortunately it doesn't get rid of the first leading space, hence the -f3不幸的是它没有摆脱第一个前导空间,因此-f3

You could make use of sed to strip both the leading spaces and the numbers printed by uniq -c您可以使用sed前导空格和uniq -c打印的数字

sort file | uniq -c | sed 's/^ *[0-9]* //'

I would illustrate this with an example.我会用一个例子来说明这一点。 Consider a file考虑一个文件

winebottles.mkv
winebottles.mov
winebottles.xges
winebottles.xges~
winebottles.mkv
winebottles.mov
winebottles.xges
winebottles.xges~

The command命令

sort file | uniq -c | sed 's/^ *[0-9]* //'

would return会回来

winebottles.mkv
winebottles.mov
winebottles.xges
winebottles.xges~

first solution第一个解决方案
just using sort when input repetition has not been taken into consideration.仅在未考虑输入重复时使用sort sort has unique option -u sort有唯一的选项-u

  • sort -u file
  • sort -u < file

Ex.:前任。:

$ cat > file
a
b
c
a
a
g
d
d
$ sort -u file
a
b
c
d
g

second solution第二种解决方案
if sort ing based on repetition is important如果基于重复的sort很重要

  • sort txt | uniq -c | sort -k1 -nr | sed 's/^ \+[0-9]\+ //g'
  • sort txt | uniq -c | sort -k1 -nr | perl -lpe 's/^ +[\d]+ +//g'

which has this output:其中有这个 output:

a
d
g
c
b

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM