In Bash (on Ubuntu), is there a command which removes invalid multibyte (non-ASCII) characters?
I've tried perl -pe 's/[^[:print:]]//g'
but it also removes all valid non-ASCII characters.
I can use sed
, awk
or similar utilities if needed.
The problem is that Perl does not realize that your input is UTF-8; it assumes it's operating on a stream of bytes. You can use the -CI
flag to tell it to interpret the input as UTF-8. And, since you will then have multibyte characters in your output , you will also need to tell Perl to use UTF-8 in writing to standard output, which you can do by using the -CO
flag. So:
perl -CIO -pe 's/[^[:print:]]//g'
If you want a simpler alternative to Perl, try iconv
as follows:
iconv -c <<<$'Mot\x{fc}rhead' # -> 'Motrhead'
-f
(eg, -f UTF8
); the output encoding with -t
(eg, -t UTF8
) - run iconv -l
to see all supported encodings. -c
simply discards input chars. that aren't valid in the input encoding; in the example, \\x{fc}
is the single-byte LATIN1 (ISO8859-1) representation of ö
, which is invalid in UTF8 (where it's represented as \\x{c3}\\x{b6}
). Note (after discovering a comment by the OP): If your output still contains garbled characters:
" (question mark) or (box with hex numbers in it)"
the implication is indeed that the cleaned-up string contains - valid - UTF-8 characters that the font being used doesn't support.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.