How to remove invalid characters from an xml file using sed or Perl

Question

I want to get rid of all invalid characters; example hexadecimal value 0x1A from an XML file using sed.
What is the regex and the command line?
EDIT
Added Perl tag hoping to get more responses. I prefer a one-liner solution.
EDIT
These are the valid XML characters

x9 | xA | xD | [x20-xD7FF] | [xE000-xFFFD] | [x10000-x10FFFF]

Answer 1

Assuming UTF-8 XML documents:

perl -CSDA -pe'
   s/[^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+//g;
' file.xml > file_fixed.xml

If you want to encode the bad bytes instead,

perl -CSDA -pe'
   s/([^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}])/
      "&#".ord($1).";"
   /xeg;
' file.xml > file_fixed.xml

You can call it a few different ways:

perl -CSDA     -pe'...' file.xml > file_fixed.xml
perl -CSDA -i~ -pe'...' file.xml     # Inplace with backup
perl -CSDA -i  -pe'...' file.xml     # Inplace without backup

Answer 2

The tr command would be simpler. So, try something like:

cat <filename> | tr -d '\032' > <newfilename>

Note that ascii character '0x1a' has the octal value '032', so we use that instead with tr . Not sure if tr likes hex.

Answer 3

There is actually a way to do this with sed, like so:

cat input_file | LANG=C sed -E \
   -e 's/.*/& /g' \
   -e 's/(('\
'[\x9\xa\xd\x20-\x7f]|'\
'[\xc0-\xdf][\x80-\xbf]|'\
'[\xe0-\xec][\x80-\xbf][\x80-\xbf]|'\
'[\xed][\x80-\x9f][\x80-\xbf]|'\
'[\xee-\xef][\x80-\xbf][\x80-\xbf]|'\
'[\xf0][\x80-\x8f][\x80-\xbf][\x80-\xbf]'\
')*)./\1?/g' \
   -e 's/(.*)\?/\1/g' \
   -e 's|]]>|]]>]]<![CDATA[>|g' > output_file

This works in four steps:

Add a single whitespace character to the end of every line.
Replace every sequence of legal characters followed by any character with the same sequence of legal characters followed by a question mark character (instead of the any). Note that in a line of only legal characters, the '.' matches the last character in the line, which is why we added a space in step 1.
Remove the last character in the line, which we expect to be a question mark.
Replace the string ']]>' with ']]>]]'.

The LANG=C env variable is set to prevent sed from doing charset conversion itself - it should treat every character as 8-bit ascii.

Answer 4

尝试：

perl -pi -e 's/[^\x9\xA\xD\x20-\x{d7ff}\x{e000}-\x{fffd}\x{10000}-\x{10ffff}]//g' file.xml

How to remove invalid characters from an xml file using sed or Perl

Question

4 answers

solution1
8 ACCPTED 2011-10-14 23:10:22

solution2
2 2011-10-14 20:13:02

solution3
0 2018-09-28 13:49:21

solution4
0 2011-10-14 21:38:18

How to remove invalid characters from an xml file using sed or Perl

Question

4 answers

solution1 8 ACCPTED 2011-10-14 23:10:22

solution2 2 2011-10-14 20:13:02

solution3 0 2018-09-28 13:49:21

solution4 0 2011-10-14 21:38:18

solution1
8 ACCPTED 2011-10-14 23:10:22

solution2
2 2011-10-14 20:13:02

solution3
0 2018-09-28 13:49:21

solution4
0 2011-10-14 21:38:18