简体   繁体   中英

How to extract data between a particular tag from a string in Perl?

For example, from the following string

<?xml version="1.0"?><root><point><message>hello world 1</message></point><point><data><message>hello world 2</message></data></point></root>

if I want to extract message , the result should be

hello world 1
hello world 2

Is there an easy way to do this?

All I can think of is to first find out the position of and and then generate substrings in a loop. Is there a better way?

Your data is not XML, so I guess you'll have to use a regular expression for that:

perl -n -E'say $1 while m{<message>(.*?)</message>}g' your_file_here.xml 

If your file was proper XML, then XML::Twig would work nicely. You could even use the xml_grep tool that comes with it to do just what you want.

update : with valid XML you can then do

xml_grep --text_only message mes.xml 

or

xml_grep2 --text_only '//message' mes.xml # xml_grep2 is in App::xml_grep2

or

perl -MXML::Twig -E'XML::Twig->new( twig_handlers => 
                                      { message => sub { say $_->text; }, })
                             ->parsefile( "mes.xml")'

Use an XML parser. XML::Parser in Subs mode seems good enough.

Use an XML parser. I like XML::LibXML .

use strict;
use warnings;
use feature qw( say );

use XML::LibXML qw( );

my $xml = <<'__EOI__';
   <?xml version="1.0"?><root>
   <point><message>hello world 1</message></point>
   <point><data><message>hello world 2</message></data></point>
   </root>
__EOI__

my $parser = XML::LibXML->new();
my $doc    = $parser->parse_string($xml);
my $root   = $doc->documentElement();

say $_->textContent() for $root->findnodes('//message');

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM