简体   繁体   English

如何使用Perl将一堆文件从ISO-8859-1转换为UTF-8?

[英]How can I convert a bunch of files from ISO-8859-1 to UTF-8 using Perl?

I have several documents I need to convert from ISO-8859-1 to UTF-8 (without the BOM of course). 我有一些文档需要从ISO-8859-1转换为UTF-8(当然没有BOM)。 This is the issue though. 这是问题。 I have so many of these documents (it is actually a mix of documents, some UTF-8 and some ISO-8859-1) that I need an automated way of converting them. 我有很多这样的文档(实际上是文档,一些UTF-8和一些ISO-8859-1的混合),我需要一种自动的方式来转换它们。 Unfortunately I only have ActivePerl installed and don't know much about encoding in that language. 不幸的是,我只安装了ActivePerl,对这种语言的编码了解不多。 I may be able to install PHP, but I am not sure as this is not my personal computer. 我也许可以安装PHP,但是我不确定这不是我的个人计算机。

Just so you know, I use Scite or Notepad++, but both do not convert correctly. 请注意,我使用Scite或Notepad ++,但两者均无法正确转换。 For example, if I open a document in Czech that contains the character "ž" and go to the "Convert to UTF-8" option in Notepad++, it incorrectly converts it to an unreadable character. 例如,如果我用捷克语打开一个包含字符“ž”的文档,然后转到Notepad ++中的“转换为UTF-8”选项,则会错误地将其转换为不可读的字符。

There is a way I CAN convert them, but it is tedious. 我可以通过某种方式转换它们,但这很繁琐。 If I open the document with the special characters and copy the document to Windows clipboard, then paste it into a UTF-8 document and save it, it is okay. 如果我使用特殊字符打开文档并将其复制到Windows剪贴板,然后将其粘贴到UTF-8文档中并保存,就可以了。 This is too tedious (opening every file and copying/pasting into a new document) for the amount of documents I have. 对于我拥有的文档数量来说,这太麻烦了(打开每个文件并将其复制/粘贴到新文档中)。

Any ideas? 有任何想法吗? Thanks!!! 谢谢!!!

If the character 'ž' is included then the encoding is definitely not ISO-8859-1 ("Latin 1") but is probably CP1252 ("Win Latin 1"). 如果包含字符“ž”,则编码绝对不是ISO-8859-1(“拉丁1”),而可能是CP1252(“ Win Latin 1”)。 Dealing with a mix of UTF8, ISO-8859-1 and CP1252 (possibly even in the same file) is exactly what the Encoding::FixLatin Perl module is designed for. 编码UTF8,ISO-8859-1和CP1252的混合(甚至可能在同一文件中)正是Encoding :: FixLatin Perl模块设计的目的。

You can install the module from CPAN by running this command: 您可以通过运行以下命令从CPAN安装模块:

perl -MCPAN -e "install 'Encoding::FixLatin'"

You could then write a short Perl script that uses the Encoding::FixLatin module, but there's an even easier way. 然后,您可以编写一个简短的Perl脚本,该脚本使用Encoding :: FixLatin模块,但是有一种更简单的方法。 The module comes with a command called fix_latin which takes mixed encoding on standard input and writes UTF8 on standard output. 该模块带有一个名为fix_latin的命令,该命令对标准输入进行混合编码,并在标准输出上写入UTF8。 So you could use a command line like this to convert one file: 因此,您可以使用如下命令行来转换一个文件:

fix_latin <input-file.txt >output-file.txt

If you're running Windows then the fix_latin command might not be in your path and might not have been run through pl2bat in which case you'd need to do something like: 如果您正在运行Windows,则fix_latin命令可能不在您的路径中,并且可能没有通过pl2bat运行,在这种情况下,您需要执行以下操作:

perl C:\perl\bin\fix_latin.pl <input-file.txt >output-file.txt

The exact paths and filenames would need to be adjusted for your system. 确切的路径和文件名将需要针对您的系统进行调整。

To run fix_latin across a whole bunch of files would be trivial on a Linux system but on Windows you'd probably need to use the powershell or similar. 在Linux系统上,要在一大堆文件上运行fix_latin是微不足道的,但是在Windows上,您可能需要使用Powershell或类似工具。

I'm not sure if this is a valid answer to your particular question, but have you looked at the GNU iconv tool ? 我不确定这是否是您特定问题的有效答案,但是您是否看过GNU iconv工具 It's fairly generally available. 它是相当普遍的。

If you have access to cygwin or are able to download a couple of common *nix tools (you'll need bash, grep, iconv and file, all of which are available for windows via, say, gnuwin32 ), you might be able to write a rather simple shell script that does the job. 如果您可以使用cygwin或可以下载几个常用的* nix工具(您将需要bash,grep,iconv和file,所有这些工具都可以通过gnuwin32在Windows上使用),则可以编写一个相当简单的shell脚本来完成这项工作。

The script would approximately look as follows: 该脚本大致如下所示:

for f in *;
do
   if file $f | grep 'ISO-8859' > /dev/null;
   then
      cat $f | iconv -f iso-8859-1 -t utf-8 > $f.converted;
   else
      echo "Not converting $f"
   fi;
done;

You'll need to test the steps though, eg I'm not sure what would "file" exactly say for a ISO-8859 document. 但是,您将需要测试这些步骤,例如,我不确定对于ISO-8859文档,“文件”到底要说什么。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM