[英]Extract data from xml file
I have a xml file containing thousands of entries, like: 我有一个包含数千个条目的xml文件,例如:
<gml:featureMember>
<Feature>
<featureType>JCSOutput</featureType>
<property name="gml2_coordsys"></property>
<gml:PointProperty>
<gml:Point>
<gml:coordinates>4048313.294966287,5374397.792158723 </gml:coordinates>
</gml:Point>
</gml:PointProperty>
<property name="BEZEICHNUN">Anton-Bosch-Gasse</property>
<property name="WL_NUMMER">68</property>
</Feature>
</gml:featureMember>
<gml:featureMember>
<Feature>
<featureType>JCSOutput</featureType>
<property name="gml2_coordsys"></property>
<gml:PointProperty>
<gml:Point>
<gml:coordinates>4044355.0231338665,5365146.95116724 </gml:coordinates>
</gml:Point>
</gml:PointProperty>
<property name="BEZEICHNUN">Anschützgasse</property>
<property name="WL_NUMMER">67</property>
</Feature>
</gml:featureMember>
The script should search for a name given in a list (for example Anton-Bosch-Gasse) and copy the whole paragraph starting with <gml:featureMember>
to a new file 该脚本应搜索列表中给定的名称(例如Anton-Bosch-Gasse),然后将以
<gml:featureMember>
开头的整个段落复制到新文件中
What would you use for this purpose - awk, sed, perl? 为此,您将使用什么-awk,sed,perl?
Sed and awk are not the right tools to parse XML. Sed和awk不是解析XML的正确工具。 Reach for Perl:
接触Perl:
#!/usr/bin/perl
use warnings;
use strict;
use XML::LibXML;
my $search = 'Anton-Bosch-Gasse';
# Put your real values here!
my $file = '1.xml';
my $uri = 'http://1.2.3';
my $xpc = XML::LibXML::XPathContext->new;
$xpc->registerNs('gml', $uri);
my $xml = XML::LibXML->load_xml(location => $file);
my $r = $xml->find("//property[.='$search']/ancestor::gml:featureMember");
print $_->serialize for @$r;
Or, if you find the above example too verbose, you can use xsh : 或者,如果您发现上面的示例太冗长,则可以使用xsh :
my $search = 'Anton-Bosch-Gasse' ;
register-namespace gml http://1.2.3 ; # Insert the real URI.
open 1.xml ; # Insert the real path.
ls //property[.=$search]/ancestor::gml:featureMember ;
Using xml_grep
, which comes with XML::Twig , you can write 使用XML :: Twig随附的
xml_grep
,您可以编写
$ xml_grep --root 'gml:featureMember' \ --cond 'property[string()="Anton-Bosch-Gasse"]' \ to_grep.xml > extract.xml
Here is a solution like choroba's but using the Mojolicious suite. 这是类似于choroba的解决方案,但使用Mojolicious套件。 Its module Mojo::DOM traverses the XML using css3 selectors rather than xpath.
它的模块Mojo :: DOM使用css3选择器而不是xpath遍历XML。
Here I find first all of the gml:featureMember
elements, then extracts the first one which has a descendant that matches. 在这里,我首先找到所有
gml:featureMember
元素,然后提取第一个具有匹配后代的元素。
#!/usr/bin/env perl
use strict;
use warnings;
use Mojo::DOM;
use Mojo::Util qw/slurp spurt/;
my $dom = Mojo::DOM->new->xml(1);
# read in from file
# $dom->parse( slurp 'myfile.xml' );
# but for the demo ...
$dom->parse(do{ local $/; <DATA> });
my $found =
$dom->find('gml\:featureMember')
->first(sub{
$_->find('property[name="BEZEICHNUN"]')
->first( qr/\QAnton-Bosch-Gasse/ )
});
spurt "$found", 'output.xml';
__DATA__
<gml:featureMember>
<Feature>
<featureType>JCSOutput</featureType>
<property name="gml2_coordsys"></property>
<gml:PointProperty>
<gml:Point>
<gml:coordinates>4048313.294966287,5374397.792158723 </gml:coordinates>
</gml:Point>
</gml:PointProperty>
<property name="BEZEICHNUN">Anton-Bosch-Gasse</property>
<property name="WL_NUMMER">68</property>
</Feature>
</gml:featureMember>
<gml:featureMember>
<Feature>
<featureType>JCSOutput</featureType>
<property name="gml2_coordsys"></property>
<gml:PointProperty>
<gml:Point>
<gml:coordinates>4044355.0231338665,5365146.95116724 </gml:coordinates>
</gml:Point>
</gml:PointProperty>
<property name="BEZEICHNUN">Anschützgasse</property>
<property name="WL_NUMMER">67</property>
</Feature>
</gml:featureMember>
For this example I grab the XML from the DATA section. 对于此示例,我从DATA部分获取XML。 You might use the commented code to parse from a file.
您可以使用注释的代码从文件中进行解析。
You can also be a little more efficient if you are sure that the property is two deep in the structure consistently. 如果您确定该属性在结构中始终是两个深处,那么您也可以提高效率。
my $found =
$dom->find('gml\:featureMember property[name="BEZEICHNUN"]')
->first( qr/\QAnton-Bosch-Gasse/ )
->parent
->parent;
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.