如何使用Perl将一堆文件从ISO-8859-1转换为UTF-8？

Question

I have several documents I need to convert from ISO-8859-1 to UTF-8 (without the BOM of course). 我有一些文档需要从ISO-8859-1转换为UTF-8（当然没有BOM）。 This is the issue though. 这是问题。 I have so many of these documents (it is actually a mix of documents, some UTF-8 and some ISO-8859-1) that I need an automated way of converting them. 我有很多这样的文档（实际上是文档，一些UTF-8和一些ISO-8859-1的混合），我需要一种自动的方式来转换它们。 Unfortunately I only have ActivePerl installed and don't know much about encoding in that language. 不幸的是，我只安装了ActivePerl，对这种语言的编码了解不多。 I may be able to install PHP, but I am not sure as this is not my personal computer. 我也许可以安装PHP，但是我不确定这不是我的个人计算机。

Just so you know, I use Scite or Notepad++, but both do not convert correctly. 请注意，我使用Scite或Notepad ++，但两者均无法正确转换。 For example, if I open a document in Czech that contains the character "ž" and go to the "Convert to UTF-8" option in Notepad++, it incorrectly converts it to an unreadable character. 例如，如果我用捷克语打开一个包含字符“ž”的文档，然后转到Notepad ++中的“转换为UTF-8”选项，则会错误地将其转换为不可读的字符。

There is a way I CAN convert them, but it is tedious. 我可以通过某种方式转换它们，但这很繁琐。 If I open the document with the special characters and copy the document to Windows clipboard, then paste it into a UTF-8 document and save it, it is okay. 如果我使用特殊字符打开文档并将其复制到Windows剪贴板，然后将其粘贴到UTF-8文档中并保存，就可以了。 This is too tedious (opening every file and copying/pasting into a new document) for the amount of documents I have. 对于我拥有的文档数量来说，这太麻烦了（打开每个文件并将其复制/粘贴到新文档中）。

Any ideas? 有任何想法吗？ Thanks!!! 谢谢！！！

Answer 1

If the character 'ž' is included then the encoding is definitely not ISO-8859-1 ("Latin 1") but is probably CP1252 ("Win Latin 1"). 如果包含字符“ž”，则编码绝对不是ISO-8859-1（“拉丁1”），而可能是CP1252（“ Win Latin 1”）。 Dealing with a mix of UTF8, ISO-8859-1 and CP1252 (possibly even in the same file) is exactly what the Encoding::FixLatin Perl module is designed for. 编码UTF8，ISO-8859-1和CP1252的混合（甚至可能在同一文件中）正是Encoding :: FixLatin Perl模块设计的目的。

You can install the module from CPAN by running this command: 您可以通过运行以下命令从CPAN安装模块：

perl -MCPAN -e "install 'Encoding::FixLatin'"

You could then write a short Perl script that uses the Encoding::FixLatin module, but there's an even easier way. 然后，您可以编写一个简短的Perl脚本，该脚本使用Encoding :: FixLatin模块，但是有一种更简单的方法。 The module comes with a command called fix_latin which takes mixed encoding on standard input and writes UTF8 on standard output. 该模块带有一个名为fix_latin的命令，该命令对标准输入进行混合编码，并在标准输出上写入UTF8。 So you could use a command line like this to convert one file: 因此，您可以使用如下命令行来转换一个文件：

fix_latin <input-file.txt >output-file.txt

If you're running Windows then the fix_latin command might not be in your path and might not have been run through pl2bat in which case you'd need to do something like: 如果您正在运行Windows，则fix_latin命令可能不在您的路径中，并且可能没有通过pl2bat运行，在这种情况下，您需要执行以下操作：

perl C:\perl\bin\fix_latin.pl <input-file.txt >output-file.txt

The exact paths and filenames would need to be adjusted for your system. 确切的路径和文件名将需要针对您的系统进行调整。

To run fix_latin across a whole bunch of files would be trivial on a Linux system but on Windows you'd probably need to use the powershell or similar. 在Linux系统上，要在一大堆文件上运行fix_latin是微不足道的，但是在Windows上，您可能需要使用Powershell或类似工具。

Answer 2

I'm not sure if this is a valid answer to your particular question, but have you looked at the GNU iconv tool ? 我不确定这是否是您特定问题的有效答案，但是您是否看过GNU iconv工具？ It's fairly generally available. 它是相当普遍的。

Answer 3

If you have access to cygwin or are able to download a couple of common *nix tools (you'll need bash, grep, iconv and file, all of which are available for windows via, say, gnuwin32 ), you might be able to write a rather simple shell script that does the job. 如果您可以使用cygwin或可以下载几个常用的* nix工具（您将需要bash，grep，iconv和file，所有这些工具都可以通过gnuwin32在Windows上使用），则可以编写一个相当简单的shell脚本来完成这项工作。

The script would approximately look as follows: 该脚本大致如下所示：

for f in *;
do
   if file $f | grep 'ISO-8859' > /dev/null;
   then
      cat $f | iconv -f iso-8859-1 -t utf-8 > $f.converted;
   else
      echo "Not converting $f"
   fi;
done;

You'll need to test the steps though, eg I'm not sure what would "file" exactly say for a ISO-8859 document. 但是，您将需要测试这些步骤，例如，我不确定对于ISO-8859文档，“文件”到底要说什么。

如何使用Perl将一堆文件从ISO-8859-1转换为UTF-8？

问题描述

3 个解决方案

解决方案1
5 已采纳 2010-04-18 02:20:55

解决方案2
1 2010-04-17 00:17:40

解决方案3
1 2010-04-17 00:21:18

如何使用Perl将一堆文件从ISO-8859-1转换为UTF-8？

问题描述

3 个解决方案

解决方案1 5 已采纳 2010-04-18 02:20:55

解决方案2 1 2010-04-17 00:17:40

解决方案3 1 2010-04-17 00:21:18

解决方案1
5 已采纳 2010-04-18 02:20:55

解决方案2
1 2010-04-17 00:17:40

解决方案3
1 2010-04-17 00:21:18