简体   繁体   English

Perl:如何在没有根节点的情况下处理XML对象流

[英]Perl: How to handle a stream of XML Objects without a root node

I need to parse a huge file with Perl. 我需要用Perl解析一个巨大的文件。 (so I'll be using a streaming parser ..) The file contains multiple XML documents (Objects), but no root node. (所以我将使用流解析器..)该文件包含多个XML文档(对象),但没有根节点。 This causes the XML parser to abort after the first Object, as it should. 这会导致XML解析器在第一个Object之后中止,就像它应该的那样。 The answer is probably to pre/post fix a fake root node. 答案可能是修复假根节点之前/之后。

<FAKE_ROOT_TAG>Original Stream</FAKE_ROOT_TAG>

Since the file is huge (>1GByte) I don't want to copy/rewrite it, but would rather use a class/module that transparently (for the XML Parser) "merges" or "concatinates" multiple streams. 由于文件很大(> 1GByte),我不想复制/重写它,而是宁愿使用透明的类/模块(对于XML Parser)“合并”或“合并”多个流。

stream1 : <FAKE_ROOT_TAG>                 \
stream2 : Original Stream from file        >   merged stream
stream3 : </FAKE_ROOT_TAG>                / 

Can you point me to such a module or sample code for this problem? 你能指点我这个问题的模块或示例代码吗?

Here's a simple example of how you might do it by passing a fake filehandle to your XML parser. 这是一个简单的例子,说明如何通过将伪文件句柄传递给XML解析器来实现它。 This object overloads the readline operator ( <> ) to return your fake root tags with the lines from the file in between. 此对象重载readline操作符( <> )以返回假根标记,其中包含文件中的行。

package FakeFile;

use strict;
use warnings;

use overload '<>' => \&my_readline;

sub new {
    my $class = shift;
    my $filename  = shift;

    open my $fh, '<', $filename or die "open $filename: $!";

    return bless { fh => $fh }, $class;
}

sub my_readline {
    my $self = shift;
    return if $self->{done};

    if ( not $self->{started} ) {
        $self->{started} = 1;
        return '<fake_root_tag>';
    }

    if ( eof $self->{fh} ) {
        $self->{done} = 1;
        return '</fake_root_tag>';
    }

    return readline $self->{fh};
}


1;

This won't work if your parser expects a genuine filehandle (eg using something like sysread ) but perhaps you'll find it inspirational. 如果您的解析器需要一个真正的文件句柄(例如使用像sysread这样的东西),这将无法工作,但也许您会发现它是鼓舞人心的。

Example usage: 用法示例:

echo "one
two
three" > myfile
perl -MFakeFile -E 'my $f = FakeFile->new( "myfile" ); print while <$f>' 

Here's a trick pulled from PerlMonks : 这是PerlMonks提取的一个技巧:

#!/usr/bin/perl

use strict;
use warnings;

use XML::Parser;
use XML::LibXML;

my $doc_file= shift @ARGV;

my $xml=qq{
     <!DOCTYPE doc 
           [<!ENTITY real_doc SYSTEM "$doc_file">]
     >
     <doc>
         &real_doc;
     </doc>
};

{ print "XML::Parser:\n";
  my $t= XML::Parser->new( Style => 'Stream')->parse( $xml);
}

{ print "XML::LibXML:\n";
  my $parser = XML::LibXML->new();
  my $doc = $parser->parse_string($xml);
  print $doc->toString;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM