简体   繁体   中英

How can I parse multiple xml-files into one DOM object using perl XML::LibXML?

I want to parse multiple xml-files into one DOM object, using the perl module XML::LibXML.

I have an xml-file containing the filename of other xml-files to parse. If somehow possible, I would like to do parse the other xml-files in one DOM object. I am able to import all xml-file into a DOM object, one-by-one. Before, I tried using module XML::Simple (does not support DOM), and could easily merge multiple arrays from multiple xml-files. No idea how to do this using DOM. The exact content of the xml-files are not relevant for my question.

It might be possible to do what you're asking with XInclude directives. For example here's an XML document that references two other XML documents, this one is called libxml-xinclude.xml :

<wrapper xmlns:xi="http://www.w3.org/2001/XInclude">
  <xi:include href="libxml-xinclude-inc1.xml"/>
  <xi:include href="libxml-xinclude-inc2.xml"/>
</wrapper>

The first referenced document, libxml-xinclude-inc1.xml , look like this:

<doc>
  <title>This is document one</title>
</doc>

And the second referenced document, libxml-xinclude-inc2.xml , look like this:

<doc>
  <title>This is document two</title>
</doc>

XInclude directives will generally just be considered normal elements (with a namespace), but you can tell some XML parsers to process those directives and replace the elements with the contents of the referenced files. Here's an example using XML::LibXML:

#!/usr/bin/perl

use 5.010;
use strict;
use warnings;

use XML::LibXML;

my $filename = 'libxml-xinclude.xml';

my $parser = XML::LibXML->new();

my $dom = $parser->load_xml(location => $filename);

$parser->process_xincludes( $dom );

say $dom->toString();

Which will produce this output:

<?xml version="1.0"?>
<wrapper xmlns:xi="http://www.w3.org/2001/XInclude">
  <doc>
  <title>This is document one</title>
</doc>
  <doc>
  <title>This is document two</title>
</doc>
</wrapper>

Note that the final document includes the <wrapper> element from the original source as well as all the included elements from the referenced documents. You can now extract the bits you're interested in using XPath expressions .

There are potential security implications with using XInclude. The href attribute can be a URL so it could potentially go off and make HTTP requests from the host where your code runs, or pull in arbitrary files from your system (eg: href="/etc/passwd"). So you almost certainly wouldn't want to use this in code in an internet-facing web application.

If you want to parse all of the XML files in one import operation through some fashion of included documents, I don't think that is possible. If this is required then the easiest solution is to write a copypasta script to splice the files together before parsing.

However, I think that your method of reading them one by one is the correct solution. As you read each document, it can be merged into the main document through methods like adoptNode(). http://metacpan.org/pod/distribution/XML-LibXML/lib/XML/LibXML/Document.pod#adoptNode

HTH

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM