I have a bash script that uses Perl's substitution operator to replace a string within all .htm files in a specified directory.
find $files_dir -name '*.htm' | while read line; do
ReplaceString "$line"
done
function ReplaceString {
perl -pi -e 's/string1/string2/g' "$1"
rm -rf "$1.bak"
}
The problem is that some of the files contain Unicode characters (eg ''). When any Unicode character is present in a file, that file is not processed and no string replacement occurs. When I remove the Unicode from the file, the string replacement works.
I am looking for a way to make my program "Unicode aware" so that it can process any file whether it contains Unicode or not.
I've also tried using sed instead of Perl:
sed -i 's/string1/string2/g' "$1"
which gives me the same issue.
Non-working file example (trimmed down):
<html>
<head><meta http-equiv=Content-Type content="text/html; charset=unicode"></head>
<style>
<!--
/* Font definitions (generated by MS Word) */
@list l0:level3
{mso-level-text:;}
-->
</style>
<body>
<p>string1</p>
</body>
</html>
As ikegami and nm pointed out, the .htm files (which were generated using Microsoft Word), were encoded in UTF-16le. The Perl substitution operation was not understanding this encoding.
I solved the problem by using MS Word to save the non-working files with UTF-8 encoding.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.