Reading a utf8 encoded file after seek as in open(FILE, '<:utf8', $file) or die; seek(FILE, $readFrom, 0); read(FILE, $_, $size);
open(FILE, '<:utf8', $file) or die; seek(FILE, $readFrom, 0); read(FILE, $_, $size);
sometimes "breaks up" a unicode char so the beginning of the read string is not valid UTF-8.
If you then do eg s{^([^\\n]*\\r?\\n)}{}i
to strip the incomplete first line, you get "Malformed UTF-8 character (fatal)" errors.
How to fix this?
One solution, listed in How do I sanitize invalid UTF-8 in Perl? is to remove all invalid UTF-8 chars:
tr[\x{9}\x{A}\x{D}\x{20}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}][]cd;
However, to search the entire string seems like overkill, as it is only the first byte(s) in the read string that can be broken.
Can anyone suggest a way to strip only an initial invalid char (or make the above substitution not die on malformed UTF-8)?
Read the stream as bytes, strip out partial characters at the start, determine where the last full character ends, then decode what's left.
use Encode qw( STOP_AT_PARTIAL );
use Fcntl qw( SEEK_TO );
my $encoding = Encode::find_encoding('UTF-8');
open(my $FILE, '<:raw', $file) or die $!;
seek($FILE, $readFrom, SEEK_TO) or die $!;
my $bytes_read = read($FILE, my $buf, $size);
defined($bytes_read) or die $!;
$buf =~ s/^[\x80-\xBF]+//;
my $str = $encoding->decode($buf, STOP_AT_PARTIAL);
If you want to read more, use the 4-arg form of read
, and don't skip anything at the start this time.
my $bytes_read = read($FILE, $buf, $size, length($buf));
defined($bytes_read) or die $!;
$str .= $encoding->decode($buf, STOP_AT_PARTIAL);
Related reading: Convert UTF-8 byte stream to Unicode
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.