Remove invalid non-ASCII characters in Bash

Question

In Bash (on Ubuntu), is there a command which removes invalid multibyte (non-ASCII) characters?

I've tried perl -pe 's/[^[:print:]]//g' but it also removes all valid non-ASCII characters.

I can use sed , awk or similar utilities if needed.

Answer 1

The problem is that Perl does not realize that your input is UTF-8; it assumes it's operating on a stream of bytes. You can use the -CI flag to tell it to interpret the input as UTF-8. And, since you will then have multibyte characters in your output , you will also need to tell Perl to use UTF-8 in writing to standard output, which you can do by using the -CO flag. So:

perl -CIO -pe 's/[^[:print:]]//g'

Answer 2

If you want a simpler alternative to Perl, try iconv as follows:

iconv -c <<<$'Mot\x{fc}rhead'  # -> 'Motrhead'

Both the input and output encodings default to UTF-8, but can be specified explicitly: the input encoding with -f (eg, -f UTF8 ); the output encoding with -t (eg, -t UTF8 ) - run iconv -l to see all supported encodings.
-c simply discards input chars. that aren't valid in the input encoding; in the example, \\x{fc} is the single-byte LATIN1 (ISO8859-1) representation of ö , which is invalid in UTF8 (where it's represented as \\x{c3}\\x{b6} ).

Note (after discovering a comment by the OP): If your output still contains garbled characters:

" (question mark) or ߻ (box with hex numbers in it)"

the implication is indeed that the cleaned-up string contains - valid - UTF-8 characters that the font being used doesn't support.

Remove invalid non-ASCII characters in Bash

Question

2 answers

solution1
3 ACCPTED 2014-06-22 19:19:28

solution2
0 2014-06-22 19:55:22

Remove invalid non-ASCII characters in Bash

Question

2 answers

solution1 3 ACCPTED 2014-06-22 19:19:28

solution2 0 2014-06-22 19:55:22

solution1
3 ACCPTED 2014-06-22 19:19:28

solution2
0 2014-06-22 19:55:22