简体   繁体   中英

unix - print distinct list of control characters in a file

For example given an input file like below:

sid|storeNo|latitude|longitude
2|1|-28.03õ720000
9|2
10
jgn
352|1|-28.03¿720000
9|2|fd¿kjhn422-405
000¥0543210|gf¿djk39
gfd|f¥d||fd

Output (the characters below can appear in any order):

¿õ¥

Does anyone have a function (awk, bash, perl.etc) that could scan each line and then output (in octal, hex or ascii - either is fine) a distinct list of the control characters (for simplicity, control characters being those above ascii char 126) found?

Using perl v5.8.8.

To print the bytes in octal:

perl -ne'printf "%03o\n", ord for /[^\x09\x0A\x20-\x7E]/g' file  | sort -u

To print the bytes in hex:

perl -ne'printf "%02X\n", ord for /[^\x09\x0A\x20-\x7E]/g' file  | sort -u

To print the original bytes:

perl -nE'say for /[^\x09\x0A\x20-\x7E]/g' file  | sort -u

This should catch everything over ordinal value 126 without having to explicitly weed out outliers

#!/bin/bash

while IFS= read -n1 c; do 
  if (( $(printf "%d" "'$c") > 126)); then
    echo "$c"
  fi
done < ./infile | sort -u

Output

¥
¿
õ

To delete everything except the control characters:

tr -d '\0-\176' < input > output

To test:

printf 'foobar\n\377' | tr -d '\0-\176' | od -t c

See tr(1) man page for details.

sed -e 's/[A-Za-z0-9,|]//g' -e 's/-//g' -e 's/./&^M/g' | sort -u

Delete everything you don't want, put everything else on its own line, then sort -u the whole kit.

The "&^M" is "&" followed by Ctrl-V followed by Ctrl-M in Bash.

Unix wins.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM