简体   繁体   English

Perl将xml解码为hash

[英]Perl decoding xml into hash

I need to decode a complex XML structure. 我需要解码复杂的XML结构。 The XML looks like this: XML看起来像这样:

<?xml version="1.0" encoding="ISO-8859-1"?>
    <MainNode comment="foo">
      <FirstMainBranch>
        <Struct>
          <String name="aStringValueUnderMainBranch" comment="Child node under first main branch"/>
          <String name="anotherStringValueUnderMainBranch" comment="Child node under first main branch"/>
          <Integer name="anIntegerValueUnderMainBranch" comment="Child node under first main branch"/>
          <List name="aList" comment="According to me this node should be an array, it could contain one or more child elements">
            <Struct comment="The node name means that, the child nodes are grouped, I think that the most appropriate structure here is hash. 
        The node itself doesn't have name attribute, which means that it only shows the type of the element">
          <String name="first" comment="
            Default Value: 0 
                        "/>
          <Long name="second" comment="
            Default Value: 0 

                          "/>
          <Long name="third" comment="
            Default Value: 0 

                        "/>
        </Struct>
      </List>
      <List name="secondList" comment="According to me this node should be array, it could contain one or more child elements">
        <Struct comment="The node name means that, the child nodes are grouped, I think that the most appropriate structure here is hash. 
        The node itself doesn't have name attribute, which means that it only shows the type of the element
                    ">
          <String name="first" comment="
            Default Value: 0 

                          "/>
          <Long name="second" comment="
            Default Value: 0 

                          "/>        
        </Struct>
      </List>
      <Struct name="namedStruct" comment="Here the struct element has a name, which means that it should be decoded
                    ">
        <List name="thirdList" comment="Again list, but now it is inside struct element, and it contains struct element
                ">
          <Struct comment="The node name means that, the child nodes are grouped, I think that the most appropriate structure here is hash.">
            <Integer name="first" comment="Child element of the struct"/>
          </Struct>
        </List>

      </Struct>

    </Struct>
  </FirstMainBranch>
  <SecondMainBranch>
    <Struct comment="">
      <Struct name="namedStructAgain" comment="
                ">
        <String name="First" comment="
                  "/>
        <String name="Second" comment=""/>

      </Struct>
    </Struct>
  </SecondMainBranch>
</MainNode>

I think that the most appropriate container is a hash (if your opinion is different, please let me know). 我认为最合适的容器是哈希(如果您的意见不同,请告诉我)。 I'm finding difficult to decode it, because: 我发现难以解码它,因为:

  1. Main nodes do not have "name" attribute, but they should exist in the final structure 主节点没有“name”属性,但它们应该存在于最终结构中

  2. Child nodes should be read only if there is a "name" attribute, but their data type (structure) depends on not decoded parent element. 只有存在“name”属性时才应读取子节点,但它们的数据类型(结构)取决于未解码的父元素。

  3. Some of these parent elements have "name" attribute - in this case they should exist in the final structure. 其中一些父元素具有“name”属性 - 在这种情况下,它们应该存在于最终结构中。

  4. I don't care for integer, long, datetime etc. data types, they will be read as string. 我不关心整数,长整数,日期时间等数据类型,它们将被读作字符串。 The main problem here is List and Struct types 这里的主要问题是List和Struct类型

Here is my silly try to cope with the task: 这是我愚蠢地尝试应对任务:

use XML::LibXML;
use Data::Dumper;
use strict;
use warnings;
my $parser=XML::LibXML->new();
my $file="c:\\joro\\Data.xml";
my $xmldoc=$parser->parse_file($file);

sub buildHash{
my $mainParentNode=$_[0];
my $mainHash=\%{$_[1]};
my ($waitNextNode,$isArray,$arrayNode);
$waitNextNode=0;
$isArray=0;
sub xmlStructure{
my $parentNode=$_[0];
my $href=\%{$_[1]};
my ($name, %tmp);
my $parentType=$parentNode->nodeName();
$name=$parentNode->findnodes('@name');
foreach my $currentNode($parentNode->findnodes('child::*')){
my $type=$currentNode->nodeName();
if ($type&&$type eq 'List'){
$isArray=1;
}
elsif($type&&$type ne 'List'&&$parentType ne 'List'){
$isArray=0;
$arrayNode=undef;
}
if ($type&&!$currentNode->findnodes('@name')&&$type eq 'Struct'){
$waitNextNode=1;
}
else{
$waitNextNode=0;
}
if ($type&&$type ne 'List'&&$type ne 'Struct'&&!$currentNode->findnodes('@name')){
#$href->{$currentNode->nodeName()}={};
xmlStructure($currentNode,$href->{$currentNode->nodeName()});
}
# elsif ($type&&$type eq 'List'&&$currentNode->findnodes('@name')){
# print "2\n";
# $href->{$currentNode->findnodes('@name')}=[];
# xmlStructure($currentNode,$href->{$currentNode->findnodes('@name')});
# }
elsif ($type&&$type ne 'List'&&$currentNode->findnodes('@name')&&$parentType eq 'List'){
push(@{$href->{$currentNode->findnodes('@name')}},$currentNode->findnodes('@name'));
xmlStructure($currentNode,$href->{$currentNode->findnodes('@name')});

}
# elsif ($type&&$type ne 'List'&&!$currentNode->findnodes('@name')&&$parentType eq 'List'){
# print "4\n";
# push(@{$$href->{$currentNode->findnodes('@name')}},{});
##print Dumper %{$arrayNode};
# xmlStructure($currentNode,$href->{$currentNode->findnodes('@name')});
# }
else{
xmlStructure($currentNode,$href->{$currentNode->findnodes('@name')});
}
}

}
xmlStructure($mainParentNode,$mainHash);
}
my %href;
buildHash($xmldoc->findnodes('*'),\%href);
print "Printing the real HASH\n";
print Dumper %href;

but there is a long way to go, because: 1. There is a parasite, probably undefined, element between the key and the value. 但还有很长的路要走,因为:1。钥匙和价值之间有一个寄生虫,可能是未定义的元素。 2. I cannot find the way to change the data type from hash to array of the child where needed. 2.我找不到在需要的地方将数据类型从哈希更改为子数组的方法。

Here is the output: 这是输出:

$VAR1 = 'FirstMainBranch';
$VAR2 = {
          '' => {
                  'aList' => {
                             '' => {
                                     'third' => {},
                                     'second' => {},
                                     'first' => {}
                                   }
                           },
                  'namedStruct' => {
                                   'thirdList' => {
                                                  '' => {
                                                          'first' => {}
                                                        }
                                                }
                                 },
                  'anotherStringValueUnderMainBranch' => {},
                  'secondList' => {
                                  '' => {
                                          'second' => {},
                                          'first' => {}
                                        }
                                },
                  'aStringValueUnderMainBranch' => {},
                  'anIntegerValueUnderMainBranch' => {}
                }
        };
$VAR3 = 'SecondMainBranch';
$VAR4 = {
          '' => {
                  'namedStructAgain' => {
                                        'First' => {},
                                        'Second' => {}
                                      }
                }
        };

Any help will be appreciated. 任何帮助将不胜感激。 Thank you in advance. 先感谢您。

Edit: In relation with Sobrique's comment - XY Problem: 编辑:关于Sobrique的评论 - XY问题:

Here is the example string I want to parse: 这是我要解析的示例字符串:

(1,2,"N/A",-1,"foo","bar",NULL,3,2016-03-18 08:12:00.000,2016-03-18 08:12:00.559,2016-03-18 08:12:00.520,0,0,NULL,"foo","123456789",{NULL,NULL,NULL,NULL,NULL,NULL,2016-04-17 11:59:59.999,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,null,NULL,NULL,NULL,NULL,3,0,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,T,0,NULL,NULL,NULL,"9876543210",NULL,"foo","0","bar","foo","a1820000264d979c","0,0",NULL,"foo","192.168.1.82","SOAP",NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL},{INPUT="bar"},{aStringValueUnderMainBranch="ET", aList[{"first", "second", "third"}, {"first", "second", "third"}], secondList[{"first", "second"}, {"first", "second"}],namedStruct{thirdList[{first},{first}]}},{namedStructAgain{"first", "second"}},NULL,NULL,NULL,NULL,NULL)

Somehow I should separate all values and after that to identify this part: 不知何故,我应该将所有值分开,然后确定这一部分:

{aStringValueUnderMainBranch="ET", aList[{"first", "second", "third"}, {"first", "second", "third"}], secondList[{"first", "second"}, {"first", "second"}],namedStruct{thirdList[{first},{first}]}}

as FirstMainBranch and parse the corresponding values as showed in the XML. 作为FirstMainBranch并解析XML中显示的相应值。 After that I should identify: 在那之后,我应该确定:

{namedStructAgain{"first", "second"}}

as SecondMainBranch and get the respective values. 作为SecondMainBranch并获得各自的值。 There is an additional problem here with the primary data separation I should not take in mind the commas when they are between parentheses. 主要数据分离还存在一个额外的问题,当它们在括号之间时,我不应该记住逗号。

I would use a different approach. 我会用另一种方法。 Instead of converting the XML into a hash, I would map it to objects using XML::Rabbit . 我不是将XML转换为哈希,而是使用XML :: Rabbit将其映射到对象。 I wrote a small article about how to use it with a complete working example. 我写了一篇关于如何使用它的完整工作示例的小文章

XML::Rabbit has a series of advantages: XML :: Rabbit具有一系列优点:

  • Work with simple Moose objects. 使用简单的Moose对象。
  • Define the objects to be obtained in a declarative way, using XPath. 使用XPath以声明方式定义要获取的对象。
  • Parse / define only what you want. 解析/定义你想要的东西。 No need to get all the information out of the XML. 无需从XML中获取所有信息。

If your XML files are small enough for using XPath and a DOM I've found this method very clean and easy to maintain. 如果您的XML文件足够小以便使用XPath和DOM,我发现这种方法非常简洁且易于维护。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM