简体   繁体   English

如何找到文件的字符编码?

[英]How do I find the character encoding for a file?

I have an XML that does not include the encoding (charset / Character encoding / character set / character map / codeset / code page). 我有一个不包含编码的XML(字符集/字符编码/字符集/字符映射/代码集/代码页)。 This is an example for one that does: 这是一个执行以下操作的示例:

<?xml version="1.0" encoding="UTF-8"?>

The XML is being generated by a Perl script and the following is an excerpt: XML是由Perl脚本生成的,以下是摘录:

$fileName = $exportDirectory . $fileName;
open FILE, ">$fileName" or die;

The questions: 问题:

  1. In this case, is there an easy way to find the encoding for the generated XML? 在这种情况下,是否有一种简单的方法来查找生成的XML的编码?
  2. The script querying other sources of information (like Oracle database) and appends the data to the XML file. 该脚本查询其他信息源(例如Oracle数据库)并将数据附加到XML文件。 Is the charset encoding dictated by the source of information? 字符集编码是否由信息源决定? Or by the open file operation? 还是通过打开文件操作?
  3. In general, is there an easy way to find the encoding of arbitrary file? 通常,是否有一种简单的方法来查找任意文件的编码?

I tried to use LibXML: 我尝试使用LibXML:

perl -MXML::LibXML -e 'XML::LibXML->load_xml(location => "2.xml")' 2.xml:1364531: parser error : Input is not proper UTF-8, indicate encoding ! perl -MXML :: LibXML -e'XML :: LibXML-> load_xml(location =>“ 2.xml”)'2.xml:1364531:解析器错误:输入的UTF-8输入不正确,表示编码! Bytes: 0xBF 0x30 0x39 0x20 female presented in spring 09 due t ^ 字节:0xBF 0x30 0x39 0x20母头在春季 09到期t ^

I hope I supplied sufficient information. 我希望我提供了足够的信息。 Please let me know if further information is needed. 请让我知道是否需要更多信息。

You can use enca or chardet . 您可以使用encachardet

You may have to compile enca yourself. 您可能需要自己编译enca。 As for chardet, there's a chance your repo contains a packaged script. 至于chardet,您的回购有可能包含打包的脚本。

Enca works only for European languages and requires you to tell it which language the file is in. Chardet does worse in differentiating European languages encoded with 8-bit encodings, but performs better with non-European text. Enca仅适用于欧洲语言,并且要求您告诉文件文件所用的语言。Chardet在区分使用8位编码编码的欧洲语言时表现较差,但在非欧洲文本中则表现更好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM