Bash/Perl: Unicode processing issue during string substitution in .htm files

Question

I have a bash script that uses Perl's substitution operator to replace a string within all .htm files in a specified directory.

find $files_dir -name '*.htm' | while read line; do
    ReplaceString "$line"
done

function ReplaceString {
    perl -pi -e 's/string1/string2/g' "$1"
    rm -rf "$1.bak"
}

The problem is that some of the files contain Unicode characters (eg ''). When any Unicode character is present in a file, that file is not processed and no string replacement occurs. When I remove the Unicode from the file, the string replacement works.

I am looking for a way to make my program "Unicode aware" so that it can process any file whether it contains Unicode or not.

I've also tried using sed instead of Perl:

sed -i 's/string1/string2/g' "$1"

which gives me the same issue.

Non-working file example (trimmed down):

<html>
<head><meta http-equiv=Content-Type content="text/html; charset=unicode"></head>
<style>
     <!-- 
     /* Font definitions (generated by MS Word) */
     @list l0:level3
     {mso-level-text:;}
      -->
</style>
<body>
     <p>string1</p>
</body>
</html>

Answer 1

As ikegami and nm pointed out, the .htm files (which were generated using Microsoft Word), were encoded in UTF-16le. The Perl substitution operation was not understanding this encoding.

I solved the problem by using MS Word to save the non-working files with UTF-8 encoding.

Bash/Perl: Unicode processing issue during string substitution in .htm files

Question

1 answers

solution1
0 ACCPTED 2014-09-03 16:23:18

Bash/Perl: Unicode processing issue during string substitution in .htm files

Question

1 answers

solution1 0 ACCPTED 2014-09-03 16:23:18

solution1
0 ACCPTED 2014-09-03 16:23:18