简体   繁体   中英

How can I list unique characters used in a text file using linux command line tools?

I would like to list a set of characters used in a text file using linux command line tools. How can I achieve this ?

uniq utility works only on lines.

I'd use od

od -cvAnone -w1

This lists characters, showing \\escapes for non-displayables. Other formats are available


Examples:

So, to list the uniques:

od -cvAnone -w1 | sort -bu

Or to produce a top-20 histogram:

od -cvAnone -w1 | sort -b | uniq -c | sort -rn | head -n 20

See it Live On IdeOne

I prefer this way:

awk 'BEGIN{FS=""} {for(i=1;i<=NF;i++){chars[$(i)]=$(i);}} END{for(c in chars){print c;} }'

So this script is awk setipt. awk is useful for processing output of all sorts of commands.

This script have three parts:

  • BEGIN, which is done once before procesing
  • END, which is done after processing
  • in the middle there is a loop that handles the output

1)

BEGIN{FS=""} 

From here http://www.gnu.org/software/gawk/manual/html_node/Field-Splitting-Summary.html#Field-Splitting-Summary

FS == "" Each individual character in the record becomes a separate field. (This is a gawk extension; it is not specified by the POSIX standard.)

2)

{for(i=1;i<=NF;i++){chars[$(i)]=$(i);}}

chars is just an one-dimensional associative array ( http://www.gnu.org/software/gawk/manual/html_node/Array-Basics.html#Array-Basics ). I add values in it while processing each char.

3)

END{for(c in chars){print c;} }

The final section - walk through the whole array chars and just print its indexes . http://www.gnu.org/software/gawk/manual/html_node/Scanning-an-Array.html#Scanning-an-Array

PS.

As for @sehe way of processing. Look for a relatively big text file. It is >six times faster to use an associative array:

>time od -cvAnone -w1 vector.html.big | sort -bu > /dev/null

real    0m1.597s
user    0m1.619s
sys     0m0.022s

>time awk 'BEGIN{FS=""} {for(i=1;i<=NF;i++){chars[$(i)]=$(i);}} END{for(c in chars){print c;} }' vector.html.big | sort >/dev/null

real    0m0.252s
user    0m0.251s
sys     0m0.002s

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM