繁体   English   中英

使用XML :: LibXML从XML文件中提取数据

[英]Extract data from an XML file with XML::LibXML

我有一个这样的XML文件,其中包含数千个条目

<mediawiki>
  <page>
    <title>page1</title>
    <revision>
      <id>2621</id>
      <parentid>6</parentid>
      <timestamp>2005-10-09T01:00:18Z</timestamp>
      <contributor>
        <username>Chaos</username>
        <id>2</id>
      </contributor>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text xml:space="preserve">text1</text>
    </revision>
  </page>
  <page>
    <title>page2</title>
    <ns>8</ns>
    <id>7</id>
    <revision>
      <id>2619</id>
      <parentid>2618</parentid>
      <timestamp>2005-10-09T00:56:39Z</timestamp>
      <contributor>
        <username>Chaos</username>
        <id>2</id>
      </contributor>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text xml:space="preserve">text2</text>
    </revision>
  </page>
  <page>
    <title>page3</title>
    <ns>8</ns>
    <id>6</id>
    <revision>
      <id>2621</id>
      <parentid>6</parentid>
      <timestamp>2005-10-09T01:00:18Z</timestamp>
      <contributor>
        <username>Chaos</username>
        <id>2</id>
      </contributor>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text xml:space="preserve">text3</text>
    </revision>
  </page>
</mediawiki>

通过我的脚本,每个页面必须位于一个文本文件中,该文件的名称为标记<title>的内容,并且包含<text xml:space="preserve"></text>

我的密码

my $filename = "pages.xml";
my $parser   = XML::LibXML->new();
my $xmldoc   = $parser->parse_file( $filename );
my $file;

foreach my $page ( $xmldoc->findnodes( '/mediawiki/page' ) ) {

    foreach my $title ( $page->findnodes( '/mediawiki/page/title' ) ) {

        foreach my $rev ( $page->findnodes( '/mediawiki/page/revision' ) ) {

            foreach my $text ( $rev->findnodes( 'text/text()' ) ) {

                $file = $title->to_literal();
                my $newfile = "$file.txt";

                open( my $out, '>:utf8', $newfile )
                        or die "Unable to open '$newfile' for write: $!";
                my $texte = $text->data;
                print $out "$text\n";
                close $out;
            }
        }
    }
}

问题是每个构造的文件都包含与最后一个标记<text xml:space="preserve"></text>

您的错误是将所有这些嵌套for循环中,而不使用相对的XPath表达式

这应该做你想要的

use utf8;
use strict;
use warnings 'all';
use feature 'say';

STDOUT->autoflush;

use XML::LibXML;

my $filename = "pages.xml";
my $doc      = XML::LibXML->load_xml( location => $filename );

for my $page ( $doc->findnodes('/mediawiki/page') ) {

    my ($title) = $page->findnodes('title');
    my $file = $title->textContent;

    my ($rev_text) = $page->findnodes('revision/text');
    my $text = $rev_text->textContent;

    open my $fh, '>:utf8', $file
        or die qq{Unable to open "$file" for output: $!};

    print $fh "$text\n";

    close $fh;

    say qq{File "$file" written with "$text"};
}

输出

File "page1" written with "text1"
File "page2" written with "text2"
File "page3" written with "text3"

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM