Perl substr基于字节

Question

我正在为我的应用程序使用SimpleDB。 除非一个属性的限制是1024字节，否则一切顺利。 因此，对于一个长字符串，我必须将字符串切成块并保存。

我的问题是，有时我的字符串包含unicode字符（中文，日文，希腊文）， substr()函数基于字符计数而不是字节。

我试图使用字节语义或后来的substr(encode_utf8($str), $start, $length) use bytes ，但它根本substr(encode_utf8($str), $start, $length) 。

任何帮助，将不胜感激。

Answer 1

设计UTF-8使得字符边界易于检测。 要将字符串拆分为有效UTF-8的块，您只需使用以下内容：

my $utf8 = encode_utf8($text);
my @utf8_chunks = $utf8 =~ /\G(.{1,1024})(?![\x80-\xBF])/sg;

然后

# The saving code expects bytes.
store($_) for @utf8_chunks;

要么

# The saving code expects decoded text.
store(decode_utf8($_)) for @utf8_chunks;

示范：

$ perl -e'
    use Encode qw( encode_utf8 );

    # This character encodes to three bytes using UTF-8.
    my $text = "\N{U+2660}" x 342;

    my $utf8 = encode_utf8($text);
    my @utf8_chunks = $utf8 =~ /\G(.{1,1024})(?![\x80-\xBF])/sg;

    CORE::say(length($_)) for @utf8_chunks;
'
1023
3

Answer 2

除非字符串上有UTF-8标志，否则substr对1字节字符进行操作。 因此，这将为您提供已解码字符串的前1024个字节：

substr encode_utf8($str), 0, 1024;

但是，不一定要在字符边界上拆分字符串。 要在最后丢弃任何拆分字符，您可以使用：

$str = decode_utf8($str, Encode::FB_QUIET);

Perl substr基于字节

问题描述

2 个解决方案

解决方案1
5 已采纳 2012-04-24 17:42:52

解决方案2
1 2012-04-24 17:24:31

Perl substr基于字节

问题描述

2 个解决方案

解决方案1 5 已采纳 2012-04-24 17:42:52

解决方案2 1 2012-04-24 17:24:31

解决方案1
5 已采纳 2012-04-24 17:42:52

解决方案2
1 2012-04-24 17:24:31