简体   繁体   中英

How to detect malformed UTF-8 at the end of the file?

I am trying to print a warning message when reading a file (that is supposed to contain valid UTF-8) contains invalid UTF-8. However, if the invalid data is at the end of the file I am not able to output any warnings. The following MVCE creates a file containing invalid UTF-8 data (creation of the file is not relevant to the the general question, it was just added here to produce a MVCE):

use feature qw(say);
use strict;
use warnings;

binmode STDOUT, ':utf8';
binmode STDERR, ':utf8';

my $bytes = "\x{61}\x{E5}\x{61}";  # 3 bytes in iso 8859-1: aåa
test_read_invalid( $bytes );
$bytes = "\x{61}\x{E5}";  # 2 bytes in iso 8859-1: aå
test_read_invalid( $bytes );

sub test_read_invalid {
    my ( $bytes ) = @_;
    say "Running test case..";
    my $fn = 'test.txt';
    open ( my $fh, '>:raw', $fn ) or die "Could not open file '$fn': $!";
    print $fh $bytes;
    close $fh;
    my $str = '';
    open ( $fh, "<:encoding(utf-8)", $fn ) or die "Could not open file '$fn': $!";
    $str = do { local $/; <$fh> };
    close $fh;
    say "Read string: '$str'\n";
}

The output is:

Running test case..
utf8 "\xE5" does not map to Unicode at ./p.pl line 22.
Read string: 'a\xE5a'

Running test case..
Read string: 'a'

In the last test case, the invalid byte at the end of the file seems to be silently ignored by the PerlIO layer :encoding(utf-8) .

Essentially what you're seeing is the perlIO system attempting to deal with a block read ending in the middle of a utf-8 sequence. So the raw byte buffer still has the invalid byte you want, but the encoded buffer does not yet have that content because it doesn't decode properly yet and it's hoping to find another character later. You can check for this by popping the encoding layer off and doing another read and checking the length.

binmode $fh, ':pop';
my $remainder = do { local $/; <$fh>};
die "Unread Characters" if length $remainder;

I'm not sure, you may want to have your open encoding start with :raw or do binmode $fh, ':raw' instead, I've never paid much attention to the layers themselves since it usually just works. I do know that this code block works for your test case :)

I'm not sure what you are asking. To detect encoding errors in a string, you can simply attempt to decode the string. As for getting an error from writing to the file, maybe close returns an error, or you can use chomp($_); print($fh "$_\\n"); chomp($_); print($fh "$_\\n"); (seeing as unix text files should always end with a newline anyway).

open ( my $fh, '>:raw', $fn ) or die "Could not open file '$fn': $!";
#the end of the file need a single space to find a invalid UTF-8 characters. 
print $fh "$bytes ";

Output:

Running test case..
utf8 "\xE5" does not map to Unicode at ent.pl line 23.
Read string: 'a\xE5a '

Running test case..
utf8 "\xE5" does not map to Unicode at ent.pl line 23.
Read string: 'a\xE5a '

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM