如何检测latin1和UTF-8？

Question

I am extracting strings from an XML file, and even though it should be pure UTF-8, it is not.我正在从 XML 文件中提取字符串，尽管它应该是纯 UTF-8，但事实并非如此。 My idea was to我的想法是

#!/usr/bin/perl
use warnings;
use strict;
use Encode qw(decode encode);
use Data::Dumper;

my $x = "m\x{e6}gtig";
my $y = "m\x{c3}\x{a6}gtig";

my $a = encode('UTF-8', $x);
my $b = encode('UTF-8', $y);

print Dumper $x;
print Dumper $y;
print Dumper $a;
print Dumper $b;

if ($x eq $y) { print "1\n"; }
if ($x eq $a) { print "2\n"; }
if ($a eq $y) { print "3\n"; }
if ($a eq $b) { print "4\n"; }
if ($x eq $b) { print "5\n"; }
if ($y eq $b) { print "6\n"; }

outputs产出

$VAR1 = 'm�gtig';
$VAR1 = 'mægtig';
$VAR1 = 'mægtig';
$VAR1 = 'mÃ¦gtig';
3

under the idea that only a latin1 string would increase its length, but encoding an already UTF-8 also makes it longer.在只有 latin1 字符串会增加其长度的想法下，但对已经 UTF-8 进行编码也会使其更长。 So I can't detect latin1 vs UTF-8 that way.所以我无法以这种方式检测 latin1 与 UTF-8。

Question题

I would like to end up with always UTF-8 string, but how can I detect if it is latin1 or UTF-8, so I only convert the latin1 string?我想最终得到 UTF-8 字符串，但是我如何检测它是 latin1 还是 UTF-8，所以我只转换 latin1 字符串？

Being able to get a yes/no if a string is UTF-8 would be just as useful.如果字符串是 UTF-8，能够得到是/否也同样有用。

Answer 1

Due to some properties of UTF-8, it's very unlikely that text encoded using iso-8859-1 would be valid UTF-8 unless it decodes identically using both encodings ^[1] .由于 UTF-8 的某些属性，使用 iso-8859-1 编码的文本不太可能是有效的 UTF-8，除非它使用两种编码^{[1] 进行}相同的解码。

As such, the solution is to try decoding it using UTF-8.因此，解决方案是尝试使用 UTF-8 对其进行解码。 If it fails, decode it using iso-8859-1 instead.如果失败，请改用 iso-8859-1 对其进行解码。 Since decoding using iso-8859-1 is a no-op, I'll be skipping that step.由于使用 iso-8859-1 进行解码是无操作的，因此我将跳过该步骤。

utf8:: implementation: utf8:: 实现：

 my $decoded_text = $utf8_or_latin1; utf8::decode($decoded_text);

Encode:: implementation:编码::实现：

 use Encode qw( decode_utf8 ); my $decoded_text = eval { decode_utf8($utf8_or_latin1, Encode::FB_CROAK|Encode::LEAVE_SRC) } // $utf8_or_latin1;

Now, you say you want UTF-8.现在，您说您想要 UTF-8。 UTF-8 is obtained from encoding decoded text. UTF-8 是从编码解码文本中获得的。

utf8:: implementation: utf8:: 实现：

 my $utf8 = $decoded_text; utf8::encode($utf8);

Encode:: implementation:编码::实现：

 use Encode qw( encode_utf8 ); my $utf8 = encode_utf8($decoded_text);

Notes笔记

Assuming the text is either valid UTF-8 or valid iso-8859-1, my solution would only guess wrong if all of the following are true:假设文本是有效的 UTF-8 或有效的 iso-8859-1，如果以下所有内容都为真，我的解决方案只会猜测错误：
- The text is encoded using iso-8859-1 (as opposed to UTF-8),文本使用 iso-8859-1（而不是 UTF-8）编码，
- At least one of [至少其中之一 [
  <80><81><82><83><84><85><86><87><88><89><8A><8B><8C><8D><8E><8F> <80><81><82><83><84><85><86><87><88><89><8A><8B><8C><8D><8E><8F>
  <90><91><92><93><94><95><96><97><98><99><9A><9B><9C><9D><9E><9F> <90><91><92><93><94><95><96><97><98><99><9A><9B><9C><9D><9E><9F>
  <NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿ <NBSP>¡£¤¥|§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿
  ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß
  àáâãäåæçèéêëìíîïðñòóôõö÷ àáâãäåæçèéêëìííîðñòóôõö÷
  ] is present, ] 存在，
- All instances of [ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß] are followed by one of [ [ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß]的所有实例后跟[
  <80><81><82><83><84><85><86><87><88><89><8A><8B><8C><8D><8E><8F> <80><81><82><83><84><85><86><87><88><89><8A><8B><8C><8D><8E><8F>
  <90><91><92><93><94><95><96><97><98><99><9A><9B><9C><9D><9E><9F> <90><91><92><93><94><95><96><97><98><99><9A><9B><9C><9D><9E><9F>
  <NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿], <NBSP>¡£¤¥|§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],
- All instances of [àáâãäåæçèéêëìíîï] are followed by two of [ [àáâãäåæçèéêëìíîï] 的所有实例后跟两个 [
  <80><81><82><83><84><85><86><87><88><89><8A><8B><8C><8D><8E><8F> <80><81><82><83><84><85><86><87><88><89><8A><8B><8C><8D><8E><8F>
  <90><91><92><93><94><95><96><97><98><99><9A><9B><9C><9D><9E><9F> <90><91><92><93><94><95><96><97><98><99><9A><9B><9C><9D><9E><9F>
  <NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿], <NBSP>¡£¤¥|§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],
- All instances of [ðñòóôõö÷] are followed by three of [ [ðñòóôõö÷] 的所有实例后跟三个 [
  <80><81><82><83><84><85><86><87><88><89><8A><8B><8C><8D><8E><8F> <80><81><82><83><84><85><86><87><88><89><8A><8B><8C><8D><8E><8F>
  <90><91><92><93><94><95><96><97><98><99><9A><9B><9C><9D><9E><9F> <90><91><92><93><94><95><96><97><98><99><9A><9B><9C><9D><9E><9F>
  <NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿], <NBSP>¡£¤¥|§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],
- None of [øùúûüýþÿ] are present, and [øùúûüýþÿ] 都不存在，并且
- None of [没有 [
  <80><81><82><83><84><85><86><87><88><89><8A><8B><8C><8D><8E><8F> <80><81><82><83><84><85><86><87><88><89><8A><8B><8C><8D><8E><8F>
  <90><91><92><93><94><95><96><97><98><99><9A><9B><9C><9D><9E><9F> <90><91><92><93><94><95><96><97><98><99><9A><9B><9C><9D><9E><9F>
  <NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿ <NBSP>¡£¤¥|§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿
  ] are present except where previously mentioned. ] 都存在，除非前面提到过。
(<80>..<9F> are unassigned or unprintable control characters, not sure which.) （<80>..<9F> 是未分配或不可打印的控制字符，不确定是哪个。）
In other words, that code is very reliable.换句话说，该代码非常可靠。

如何检测latin1和UTF-8？

问题描述

1 个解决方案

解决方案1
8 已采纳 2014-04-04 17:02:51

如何检测latin1和UTF-8？

问题描述

1 个解决方案

解决方案1 8 已采纳 2014-04-04 17:02:51

解决方案1
8 已采纳 2014-04-04 17:02:51