How can I list unique characters used in a text file using linux command line tools?

Question

I would like to list a set of characters used in a text file using linux command line tools. How can I achieve this ?

uniq utility works only on lines.

Answer 1

I'd use od

od -cvAnone -w1

This lists characters, showing \\escapes for non-displayables. Other formats are available

Examples:

So, to list the uniques:

od -cvAnone -w1 | sort -bu

Or to produce a top-20 histogram:

od -cvAnone -w1 | sort -b | uniq -c | sort -rn | head -n 20

See it Live On IdeOne

Answer 2

I prefer this way:

awk 'BEGIN{FS=""} {for(i=1;i<=NF;i++){chars[$(i)]=$(i);}} END{for(c in chars){print c;} }'

So this script is awk setipt. awk is useful for processing output of all sorts of commands.

This script have three parts:

BEGIN, which is done once before procesing
END, which is done after processing
in the middle there is a loop that handles the output

1)

BEGIN{FS=""}

From here http://www.gnu.org/software/gawk/manual/html_node/Field-Splitting-Summary.html#Field-Splitting-Summary

FS == "" Each individual character in the record becomes a separate field. (This is a gawk extension; it is not specified by the POSIX standard.)

2)

{for(i=1;i<=NF;i++){chars[$(i)]=$(i);}}

chars is just an one-dimensional associative array ( http://www.gnu.org/software/gawk/manual/html_node/Array-Basics.html#Array-Basics ). I add values in it while processing each char.

3)

END{for(c in chars){print c;} }

The final section - walk through the whole array chars and just print its indexes . http://www.gnu.org/software/gawk/manual/html_node/Scanning-an-Array.html#Scanning-an-Array

PS.

As for @sehe way of processing. Look for a relatively big text file. It is >six times faster to use an associative array:

>time od -cvAnone -w1 vector.html.big | sort -bu > /dev/null

real    0m1.597s
user    0m1.619s
sys     0m0.022s

>time awk 'BEGIN{FS=""} {for(i=1;i<=NF;i++){chars[$(i)]=$(i);}} END{for(c in chars){print c;} }' vector.html.big | sort >/dev/null

real    0m0.252s
user    0m0.251s
sys     0m0.002s

How can I list unique characters used in a text file using linux command line tools?

Question

2 answers

solution1
12 2014-04-23 08:35:35

Examples:

solution2
6 2014-04-23 08:28:36

How can I list unique characters used in a text file using linux command line tools?

Question

2 answers

solution1 12 2014-04-23 08:35:35

Examples:

solution2 6 2014-04-23 08:28:36

solution1
12 2014-04-23 08:35:35

solution2
6 2014-04-23 08:28:36