简体   繁体   English

如何提取之间的内容 <text></text> Perl中来自Wikipedia的巨大xml文件的标签(可以是任何XML)?

[英]How to extract content between <text></text> tags in Perl from huge xml file of Wikipedia (it can be any XML)?

How to extract content between <text></text> tags in perl from Wikipedia? 如何从Wikipedia提取perl中<text></text>标记之间的内容?

I want to process utf-8 huge file - loading into memory not possible. 我想处理utf-8 大文件 -无法加载到内存中。 File contains <text>.*?</text> for each page - this is possible to be loaded into memory - and it should be load into some variable to do some further processing: 文件的每一页都包含<text>.*?</text> -可以将其加载到内存中-应该将其加载到一些变量中以进行进一步的处理:

      <text xml:space="preserve">Some text without &lt; or &lt; ....
... more text ...
... more text ...</text>

Consider that text not starts line and not ends line - important content is between <text></text> . 考虑到文本不是以开始而不是结束-重要内容在<text></text> I want to extract it and improve to generate some text file for nlp machine learning. 我想将其提取并改进以生成一些文本文件以进行nlp机器学习。

File can be download with: 可以通过以下方式下载文件:

wget http://dumps.wikimedia.org/plwiki/latest/plwiki-latest-pages-articles.xml.bz2

File can be turn into stdin pipe with: 可以使用以下命令将文件转换为stdin管道:

bzip2 -c -d plwiki-latest-pages-articles.xml.bz2 | perl something > data.txt

I am not very good in Perl and can not write good code. 我在Perl中不是很好,不能编写好的代码。 Not know how to learn matched position, to do micro state machine or to do moving window. 不知道如何学习匹配位置,做微状态机或做移动窗口。

Any suggestion will be welcome. 任何建议都将受到欢迎。

Something like this will do it: 这样的事情会做到这一点:

#!/usr/bin/perl
use strict;
use warnings;
use XML::Twig; 

sub text_handler {
    my ( $twig, $text_elt ) = @_; 
    print $text_elt -> text; 
    $twig -> purge; 
}

my $twig = XML::Twig -> new ( twig_handlers => { 'text' => \&text_handler } ) ->  parsefile ( 'your_xml');

Note the trick here is that purge which discards previously processed XML. 请注意,这里的窍门是purge将丢弃先前处理的XML。 You can probably set a purge on other elements too, if there's a lot of stuff inbetween 'text' nodes. 如果“文本”节点之间有很多东西,您可能还可以对其他元素进行清除。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM