For example given an input file like below:
sid|storeNo|latitude|longitude
2|1|-28.03õ720000
9|2
10
jgn
352|1|-28.03¿720000
9|2|fd¿kjhn422-405
000¥0543210|gf¿djk39
gfd|f¥d||fd
Output (the characters below can appear in any order):
¿õ¥
Does anyone have a function (awk, bash, perl.etc) that could scan each line and then output (in octal, hex or ascii - either is fine) a distinct list of the control characters (for simplicity, control characters being those above ascii char 126) found?
Using perl v5.8.8.
To print the bytes in octal:
perl -ne'printf "%03o\n", ord for /[^\x09\x0A\x20-\x7E]/g' file | sort -u
To print the bytes in hex:
perl -ne'printf "%02X\n", ord for /[^\x09\x0A\x20-\x7E]/g' file | sort -u
To print the original bytes:
perl -nE'say for /[^\x09\x0A\x20-\x7E]/g' file | sort -u
This should catch everything over ordinal value 126 without having to explicitly weed out outliers
#!/bin/bash
while IFS= read -n1 c; do
if (( $(printf "%d" "'$c") > 126)); then
echo "$c"
fi
done < ./infile | sort -u
¥
¿
õ
To delete everything except the control characters:
tr -d '\0-\176' < input > output
To test:
printf 'foobar\n\377' | tr -d '\0-\176' | od -t c
See tr(1) man page for details.
sed -e 's/[A-Za-z0-9,|]//g' -e 's/-//g' -e 's/./&^M/g' | sort -u
Delete everything you don't want, put everything else on its own line, then sort -u the whole kit.
The "&^M" is "&" followed by Ctrl-V followed by Ctrl-M in Bash.
Unix wins.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.