简体   繁体   中英

Remove invalid non-ASCII characters in Bash

In Bash (on Ubuntu), is there a command which removes invalid multibyte (non-ASCII) characters?

I've tried perl -pe 's/[^[:print:]]//g' but it also removes all valid non-ASCII characters.

I can use sed , awk or similar utilities if needed.

The problem is that Perl does not realize that your input is UTF-8; it assumes it's operating on a stream of bytes. You can use the -CI flag to tell it to interpret the input as UTF-8. And, since you will then have multibyte characters in your output , you will also need to tell Perl to use UTF-8 in writing to standard output, which you can do by using the -CO flag. So:

perl -CIO -pe 's/[^[:print:]]//g'

If you want a simpler alternative to Perl, try iconv as follows:

iconv -c <<<$'Mot\x{fc}rhead'  # -> 'Motrhead'
  • Both the input and output encodings default to UTF-8, but can be specified explicitly: the input encoding with -f (eg, -f UTF8 ); the output encoding with -t (eg, -t UTF8 ) - run iconv -l to see all supported encodings.
  • -c simply discards input chars. that aren't valid in the input encoding; in the example, \\x{fc} is the single-byte LATIN1 (ISO8859-1) representation of ö , which is invalid in UTF8 (where it's represented as \\x{c3}\\x{b6} ).

Note (after discovering a comment by the OP): If your output still contains garbled characters:

" (question mark) or ߻ (box with hex numbers in it)"

the implication is indeed that the cleaned-up string contains - valid - UTF-8 characters that the font being used doesn't support.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM