简体   繁体   English

Shell脚本-将xml拆分为多个文件

[英]Shell scripting - split xml into multiple files

Am trying to split a big xml file into multiple files, and have used the following code in AWK script. 我正在尝试将一个大型xml文件拆分为多个文件,并在AWK脚本中使用了以下代码。

/<fileItem>/ {
        rfile="fileItem" count ".xml"
        print "<?xml version=\"1.0\" encoding=\"UTF-8\"?>" > rfile
        print $0 > rfile
        getline
        while ($0 !~ "<\/fileItem>" ) {
                print > rfile
                getline
        }
        print $0 > rfile
        close(rfile)
        count++
}

The code above generates a list of xml files whose names read "fileItem_1", "fileItem_2", "fileItem3", etc. 上面的代码生成一个名称为“ fileItem_1”,“ fileItem_2”,“ fileItem3”等的xml文件列表。

However, I would like the file name to be something like "item_XXXXX" where the XXXXX is a node inside the XML - depicted as below 但是,我希望文件名类似于“ item_XXXXX”,其中XXXXX是XML内的一个节点-如下图所示

<fileItem>
<id>12345</id>
<name>XXXXX</name>
</fileItem>

So, basically I want the "id" node to be the filename. 因此,基本上我希望“ id”节点为文件名。 Can anyone please help me with this? 谁能帮我这个忙吗?

I would not use getline . 我不会使用getline (I even read in an AWK book that it is not recommended to use it.) I think, using global variables for state it is even simpler. (我什至在AWK书中读到,不建议使用它。)我认为,使用全局变量进行状态处理甚至更简单。 (Expressions with global variables may be used in patterns too.) (带有全局变量的表达式也可以在模式中使用。)

The script could look like this: 该脚本可能如下所示:

test-split-xml.awk : test-split-xml.awk

/<fileItem>/ {
  collect = 1 ; buffer = "" ; file = "fileItem_"count".xml"
  ++count
}

collect > 0 {
  if (buffer != "") buffer = buffer"\n"
  buffer = buffer $0
}

collect > 0 && /<name>.+<\/name>/ {
  # cut "...<name>"
  i = index($0, "<name>") ; file = substr($0, i + 6)
  # cut "</name>..."
  i = index(file, "</name>") ; file = substr(file, 1, i - 1)
  file = file".xml"
}

/<\/fileItem>/ {
  collect = 0;
  print file
  print "<?xml version=\"1.0\" encoding=\"UTF-8\"?>" >file
  print buffer >file
}

I prepared some sample data for a small test: 我准备了一些样本数据进行小型测试:

test-split-xml.xml : test-split-xml.xml

<?xml version=\"1.0\" encoding=\"UTF-8\"?>
<top>
  <some>
    <fileItem>
      <id>1</id>
      <name>X1</name>
    </fileItem>
  </some>
  <fileItem>
    <id>2</id>
    <name>X2</name>
  </fileItem>
  <fileItem>
    <id>2</id>
    <!--name>X2</name-->
  </fileItem>
  <any> other input </any>
</top>

... and got the following output: ...并获得以下输出:

$ awk -f test-split-xml.awk test-split-xml.xml
X1.xml
X2.xml
fileItem_2.xml

$ more X1.xml 
<?xml version="1.0" encoding="UTF-8"?>
    <fileItem>
      <id>1</id>
      <name>X1</name>
    </fileItem>

$ more X2.xml
<?xml version="1.0" encoding="UTF-8"?>
  <fileItem>
    <id>2</id>
    <name>X2</name>
  </fileItem>

$ more fileItem_2.xml 
<?xml version="1.0" encoding="UTF-8"?>
  <fileItem>
    <id>2</id>
    <!--name>X2</name-->
  </fileItem>

$

The comment of tripleee is reasonable. Tripleee的评论是合理的。 Thus, such processing should be limited to personal usage because different (and legal) formattings of XML files could cause errors in this script processing. 因此,这种处理应限于个人使用,因为XML文件的不同(和合法)格式可能会导致此脚本处理中的错误。

As you will notice, there is no next in the whole script. 您会注意到,整个脚本中没有next This is intentionally. 这是故意的。

First and foremost - you need a parser for this. 首先,最重要的是,您需要一个解析器。

XML is a contextual data format. XML是上下文数据格式。 Regular expressions are not. 正则表达式不是。 So you can never make a regular expression base processing system actually work properly. 因此,您永远无法使正则表达式基础处理系统真正正常工作。

It's just bad news 只是个坏消息

But parsers do exist, and they're quite easy to work with. 但是解析器确实存在,并且很容易使用。 I can give you a better example with a better data input. 我可以通过更好的数据输入为您提供更好的示例。 But I would use XML::Twig and perl to do this: 但是我会使用XML::Twigperl来做到这一点:

#!/usr/bin/env perl
use strict;
use warnings;

use XML::Twig;


#subroutine to extract and process the item
sub save_item {
   my ( $twig, $item ) = @_;
   #retrieve the id
   my $id = $item -> first_child_text('id'); 
   print "Got ID of $id\n";

   #create a new XML document for output. 
   my $new_xml = XML::Twig -> new;
   $new_xml -> set_root (XML::Twig::Elt -> new ( 'root' ));

   #cut and paste the item from the 'old' doc into the 'new'  
   #note - "cut" applies to in memory, 
   #not the 'on disk' copy. 
   $item -> cut;
   $item -> paste ( $new_xml -> root );

   #set XML params (not strictly needed but good style)
   $new_xml -> set_encoding ('utf-8');
   $new_xml -> set_xml_version ('1.0');

   #set output formatting
   $new_xml -> set_pretty_print('indented_a');

   print "Generated new XML:\n";
   $new_xml -> print;

   #open a file for output
   open ( my $output, '>', "item_$id.xml" ) or warn $!;
   print {$output} $new_xml->sprint;
   close ( $output ); 
}

#create a parser. 
my $twig = XML::Twig -> new ( twig_handlers => { 'fileItem' => \&save_item } );
#run this parser on the __DATA__ filehandle below.
#you probably want parsefile('some_file.xml') instead. 
   $twig -> parse ( \*DATA );


__DATA__
<xml>
<fileItem>
<id>12345</id>
<name>XXXXX</name>
</fileItem>
</xml>

With XML::Twig comes xml_split which may be suited to your needs 带有XML::Twig xml_split可能适合您的需求

If your XML is really that well formed and consistent then all you need is: 如果您的XML确实具有良好的格式和一致性,那么您所需要做的就是:

awk -F'[<>]' '
/<fileItem>/ { header="<?xml version=\"1.0\" encoding=\"UTF-8\"?>" ORS $0; next }
/<id> { close(out); out="item_" $3; $0=header ORS $0 }
{ print > out }
' file

The above is untested of course since you didn't provide sample input/output for us to test a possible solution against. 上面的内容当然未经测试,因为您没有为我们提供示例输入/输出来测试可能的解决方案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM