I would like to list a set of characters used in a text file using linux command line tools. How can I achieve this ?
uniq
utility works only on lines.
I'd use od
od -cvAnone -w1
This lists characters, showing \\escapes
for non-displayables. Other formats are available
So, to list the uniques:
od -cvAnone -w1 | sort -bu
Or to produce a top-20 histogram:
od -cvAnone -w1 | sort -b | uniq -c | sort -rn | head -n 20
See it Live On IdeOne
I prefer this way:
awk 'BEGIN{FS=""} {for(i=1;i<=NF;i++){chars[$(i)]=$(i);}} END{for(c in chars){print c;} }'
So this script is awk setipt. awk is useful for processing output of all sorts of commands.
This script have three parts:
1)
BEGIN{FS=""}
FS == "" Each individual character in the record becomes a separate field. (This is a gawk extension; it is not specified by the POSIX standard.)
2)
{for(i=1;i<=NF;i++){chars[$(i)]=$(i);}}
chars
is just an one-dimensional associative array ( http://www.gnu.org/software/gawk/manual/html_node/Array-Basics.html#Array-Basics ). I add values in it while processing each char.
3)
END{for(c in chars){print c;} }
The final section - walk through the whole array chars
and just print its indexes . http://www.gnu.org/software/gawk/manual/html_node/Scanning-an-Array.html#Scanning-an-Array
PS.
As for @sehe way of processing. Look for a relatively big text file. It is >six times faster to use an associative array:
>time od -cvAnone -w1 vector.html.big | sort -bu > /dev/null
real 0m1.597s
user 0m1.619s
sys 0m0.022s
>time awk 'BEGIN{FS=""} {for(i=1;i<=NF;i++){chars[$(i)]=$(i);}} END{for(c in chars){print c;} }' vector.html.big | sort >/dev/null
real 0m0.252s
user 0m0.251s
sys 0m0.002s
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.