I am extracting strings from an XML file, and even though it should be pure UTF-8, it is not. My idea was to
#!/usr/bin/perl
use warnings;
use strict;
use Encode qw(decode encode);
use Data::Dumper;
my $x = "m\x{e6}gtig";
my $y = "m\x{c3}\x{a6}gtig";
my $a = encode('UTF-8', $x);
my $b = encode('UTF-8', $y);
print Dumper $x;
print Dumper $y;
print Dumper $a;
print Dumper $b;
if ($x eq $y) { print "1\n"; }
if ($x eq $a) { print "2\n"; }
if ($a eq $y) { print "3\n"; }
if ($a eq $b) { print "4\n"; }
if ($x eq $b) { print "5\n"; }
if ($y eq $b) { print "6\n"; }
outputs
$VAR1 = 'm�gtig';
$VAR1 = 'mægtig';
$VAR1 = 'mægtig';
$VAR1 = 'mægtig';
3
under the idea that only a latin1 string would increase its length, but encoding an already UTF-8 also makes it longer. So I can't detect latin1 vs UTF-8 that way.
Question
I would like to end up with always UTF-8 string, but how can I detect if it is latin1 or UTF-8, so I only convert the latin1 string?
Being able to get a yes/no if a string is UTF-8 would be just as useful.
Due to some properties of UTF-8, it's very unlikely that text encoded using iso-8859-1 would be valid UTF-8 unless it decodes identically using both encodings [1] .
As such, the solution is to try decoding it using UTF-8. If it fails, decode it using iso-8859-1 instead. Since decoding using iso-8859-1 is a no-op, I'll be skipping that step.
utf8:: implementation:
my $decoded_text = $utf8_or_latin1; utf8::decode($decoded_text);
Encode:: implementation:
use Encode qw( decode_utf8 ); my $decoded_text = eval { decode_utf8($utf8_or_latin1, Encode::FB_CROAK|Encode::LEAVE_SRC) } // $utf8_or_latin1;
Now, you say you want UTF-8. UTF-8 is obtained from encoding decoded text.
utf8:: implementation:
my $utf8 = $decoded_text; utf8::encode($utf8);
Encode:: implementation:
use Encode qw( encode_utf8 ); my $utf8 = encode_utf8($decoded_text);
Notes
Assuming the text is either valid UTF-8 or valid iso-8859-1, my solution would only guess wrong if all of the following are true:
(<80>..<9F> are unassigned or unprintable control characters, not sure which.)
In other words, that code is very reliable.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.