简体   繁体   English

Perl和MongoDB二进制数据

[英]Perl & MongoDB binary data

From the MongoDB manual: MongoDB手册中:

By default, all database strings are UTF8. 默认情况下,所有数据库字符串均为UTF8。 To save images, binaries, and other non-UTF8 data, you can pass the string as a reference to the database. 要保存图像,二进制文件和其他非UTF8数据,可以将字符串作为对数据库的引用。

I'm fetching pages and want store the content for later processing. 我正在获取页面,并希望存储内容以供以后处理。

  • I can not rely on meta-charset, because many pages has utf8 content but wrongly declaring iso-8859-1 or similar 我不能依靠元字符集,因为许多页面都包含utf8内容,但错误地声明了iso-8859-1或类似内容
  • so can't use Encode (don't know the originating charset) 所以不能使用Encode (不知道原始字符集)
  • therefore, I want store the content simply as flow of bytes (binary data) for later processing 因此,我只想将内容存储as flow of bytes (二进制数据)以便以后处理

Fragment of my code: 我的代码片段:

sub save {
    my ($self, $ok, $url, $fetchtime, $request ) = @_;

    my $rawhead = $request->headers_as_string;
    my $rawbody = $request->content;

    $self->db->content->insert(
        { "url" => $url, "rhead" => \$rawhead, "rbody" => \$rawbody } ) #using references here
      if $ok;

    $self->db->links->update(
        { "url" => $url },
        {
            '$set' => {
                'status'       => $request->code,
                'valid'        => $ok,
                'last_checked' => time(),
                'fetchtime'    => $fetchtime,
            }
        }
    );
}

But get error: 但是得到错误:

Wide character in subroutine entry at /opt/local/lib/perl5/site_perl/5.14.2/darwin-multi-2level/MongoDB/Collection.pm line 296. 在/opt/local/lib/perl5/site_perl/5.14.2/darwin-multi-2level/MongoDB/Collection.pm第296行的子例程条目中为宽字符。

This is the only place where I storing data. 这是我存储数据的唯一地方。

The question: The only way store binary data in MondoDB is encode them eg with base64? 问题:将二进制数据存储在MondoDB中的唯一方法是对它们进行编码,例如使用base64?

It looks like another sad story about _utf8_ flag... 这似乎是关于_utf8_标志的另一个悲伤故事...

I may be wrong, but it seems that headers_as_string and content methods of HTTP::Message return their strings as a sequence of characters. 我可能是错的,但是似乎headers_as_string和HTTP :: Message的content方法将它们的字符串作为字符序列返回。 But MongoDB driver expects the strings explicitly passed to it as 'binaries' to be a sequence of octets - hence the warning drama. 但是MongoDB驱动程序希望以“二进制”形式明确传递给它的字符串是一个八位字节序列-因此是警告性的戏剧。

A rather ugly fix is to take down the utf8 flag on $rawhead and $rawbody in your code (I wonder shouldn't it be really done by MongoDB driver itself?), by something like this... 一个相当丑陋的修复方法是删除代码中$ rawhead和$ rawbody上的utf8标志(我想知道它是否真的应该由MongoDB驱动程序本身完成吗?),类似这样。

_utf8_off $rawhead; 
_utf8_off $rawbody; # ugh

The alternative is to use encode('utf8', $rawhead) - but then you should use decode when extracting values from DB, and I doubt it's not uglier. 另一种方法是使用encode('utf8', $rawhead) -但是从数据库中提取值时,您应该使用decode ,我怀疑它是否难看。

Your data is characters, not octets. 您的数据是字符,而不是八位字节。 Your assumption seems to be that you are just passing things through as octets, but you must have violated that assumption earlier somehow by decoding incoming text data, perhaps even without you noticing. 您的假设似乎是您只是通过八位字节传递消息,但是您可能早些时候通过解码传入的文本数据以某种方式违反了该假设,即使您没有注意到也是如此。

So simply do not decode, data stay octets, storing into the db won't fail. 因此,只需不解码,数据保持八位字节,存储到数据库中就不会失败。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM